[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

Singular Value Decomposition on Distributed Reconfigurable Systems

Chistophe Bobda Paderbom University

Heinz Nixdorf Institute Fuerstenallee 1 1,

33 102 Paderborn, Germany bobda @ upb.de

Abstract

The use of FPGAs(Fie1d Programmable Gate Arrays) in the area of rapid prototyping and reconfigurable computing has been successfull in the past[l]. Although many experiments have shown FPGAs to be faster than general purpose processors and more jlexible than ASICs(App1ication Specific Integrated Circuits) on some classes of problems, few experiments have offered a computing platform which exploits the reconjigurability aspect of FPGAs and combine FPGAs and processors to provide better solutions on up- plications. The goal of this paper is to show through an efJicient implementation of the Singular Value Deconiposi- tion(SVD) of very large matrices, the possibility of integrat- ing FPGAs as part of a Distributed Reconjiguruble System (DRS). A cluster of 8 worstations with two FPGA-boards was built for this purpose. The algortihm is currently running as a pure software solution, but we are working to integrate the FPGAs in the computation. First results are en- couraging, showing that the perj6ormance of the new platform can be high compare to pure software solutions.

1. Introduction

The Singular Value Decomposition of a matrix has many important scientific and engineering applications[5,2,6]. In information retrieval, “term by document matrices” [2 11 are used to index document collections. Statistical or boolean methods are then used to analyse the collection in order to improve queries processing. One of the successfull method to extract information from a document collection is the La- tent Semantic Indexing(LSI)[6, 21. LSI is based on the assumption that there is an underlying structure which gov- erns the use of words across a document collection and that statistical methods can be used to extract this structure. One of the methods used by LSI to analyse the term by document matrix in order to extract the underlying structure of

1074-6005/01 $10.00 0 2001 IEEE

Nils Steenbock Paderbom University

Heinz Nixdorf Institute Fuerstenallee 1 1,

33 102 Paderborn, Germany nsteenb @ upb.de

the document collection is the Singular Value Decomposi- tion(SVD). Computation of the SVD of big matrices is very time consuming. It can take for example about 18 hours on a SUN SPARC 10 Workstation for a 90.000 by 70.000 matrix[2]. With increasing size of documents in the internet, term by documents matrices are becoming very large (for example lo6 x lo6) making it impracticable to compute their SVD. Instead of computing the SVD of such a huge amount of data, Berry et a1[2] suggested the SVD- computation of only a selected subspace of the documents collection and fold the rest of documents into the selected subspace. The price to pay when doing so is the loss of accuracy of the queries[2]. Ideally one would like to compute the SVD of the matrix representing the complete collection to provide the best accuracy for any query in the collection. The primary reason for us to consider a Distributed Recon- figurable System(DRS) as platform for computing the SVD of very large term by document matrices was to drastically increase the accuracy of indexation by increasing the size of the subspace to consider for the SVD. In this paper we show how our system can deal with matrices of size lo6 x lo6 in a reasonable time if the architecture is also built efficiently. Section 2 provides the definition of the SVD problem and the way a Jacobi-like solution due to Hestenes can be found. Our implementation is based on the Brent and Luk parallel distribution of the Hestenes method on an array of multipro- cessors that we show in section 2.2. We describe the new architecture with it’s advantages in section 3. Section 4 explains how orthogonalisation of a block matrix, which is the kernel of the method of Hestenes, is done on a node of the DRS. We show why and how the reconfiguration happens on FPGAs and provide some analysis on the performance of the system. Finally, we give an overview of our work and the challenge for the future in section 5.

38

2 The Singular Value Decomposition

The Singular Value Decomposition of an m x n real matrix A is its factorisation into a product:

A = U x C x V T (1)

where U is an (m x n) matrix with orthogonal columns, C = diag(g1, ... ,on) and V is an (n x n) orthogonal matrix. The values gi of the diagonal matrix C are the singular values of the matrix A, the columns of the matrices U and V are the left and right singular vectors of A. The singular vectors are orthogonalised eigenvectors associated with eigenvalues of the symmetric matrices ATA and AAT respectively. The singular values of A are the square roots of the eigenvalues of ATA. Many algorithms have been developed and implemented for computing the singular value decomposition[ 10,3]. Among them are the Jacobi methods. Although being very effec- tive for small dimensioned matrices, they are usually slower than many other methods on sequential computers[ 10, 31. Meanwhile the inherent parallelism which characterises the Jacobi method has made them an attractive solution on parallel computers. The singular value decomposition of a matrix A can be computed by first computing the eigenvalues and eigenvectors of the symmetric matrices ATA or AAT and then derive the singular values of A from those calcu- lated values. Building the matrix ATA or AAT for large matrices (lo6 x lo6) will lead to big overhead in space and time. Using the one side orthogonalisation method of Hestenes[l6], we do not necessarily need to first build the matrix ATA or AAT before generating the eigenvalue decomposition.

2.1 The Hestenes Method

Further on we will use the notation of [lo], W(:,i) and W(;,:) respectively for the i-th colmun and the j-th row of the matrix W . The Hestenes method is based on the classical one side Jacobi iteration for diagonalisation of real symmetric matrices. The idea is to generate an orthogonal matrix V such that the transformed matrix AV = W has orthogonal columns. Having the matrix W , the euclidean length of each non-zero column W(:,i) will be normahsed to uni- tiy. The singular values and vectors are then computed as follows:

( ~ i = llW(:,i)ll, U(:,i) = and W = UC (2) fli

The SVD of the matrix A is then given by:

AV = W -+ AV = UC -+ A = UCVT (3)

Plane rotations represented by a matrix Q are incrementally applied to the matrix A to compute the matrix W . At the k-th step with the rotation matrix Q ( k ) , we have:

A("') = Q(')A("), 0 5 k 5 k, (4)

With a sweep defined to be a series of 9 pairwise column-orthogonalisations of the matrix '4('), the convergence of the matrix A ( k ) to W is guaranteed for I C , = S 9 , where S is the number of sweeps, Ao = A and W = Akr . If Q(') is a @!:)-rotation in the plane (i, j ) with i < j, we

have Q(') = (48;)) where:

q!? = q?) = cos@!!) 22 33 23 ' 23 3% 23 ' q!!) = -q(.!) = sine!!)

46;) = 1 &;) = 0

i f s # j and s # j , if ( s # j or t # j) and s # t

( 5 )

Post-multiplication of A(') by Q ( k ) affects only the column-pair (Ai:,!), AI")). The computation of A('+') = Q ( k ) A ( k ) is reduced to:

The rotation angle O!:' is chosen such a way that the new column pairs are orthogonal. Using the formulas of Rustishauser[20], we first compute the dot products:

We set 0:;) = 0 if -yjj!) = 0; Otherwise we compute:

The rotation angle always satisfies:

39

The matrix V is updated in the same way as matrix A:

V ( k + l ) = V(")Q("), 0 < - - < k , ( 1 1)

With V(O) = In and V = V('Ep) The bulk of this method is the generation of the rotation angle (equations (7) to (9)), the updating of the columns elements (equation(6)) and the test for orthogonality of two columns with the threshold method of [24]. Those three steps are repeated for each column pair of the matrix. While equations (8) and (9) need only a dozen instructions for each column pair, the number of instructions need for equation (6) and (7) is a multiple of their column size. Those two equations are dataflow oriented. Because of the huge amount of data to be computed, it is advantageous to implement them as harware module. Data will then be streamed on input of the module and the result can be collected at the end on the output(fig4). Section 4 explains how our implementation is done.

2.2 Parallel implementation

Choosing the pairs ( i j ) on serial machines is usually done according to a fixed cycle. A simple sweep consists of a cyclic by row ordering:

Because of the convergence of the cyclic-by-rows Jacobi method with condition( lo)[& 51, the Hestenes method always converges[ 16, 51. The inherent parallelism in a sweep is characterised by the fact that operations on the column pair (A[!!, , A[::).) as shown in (6 - 9) affects only those two columns. It is then possible to carry the two columns orthogonalisation in a sweep in parallel. In [ 141 Hansen discusses the convergence associated with various cyclic orderings of a sweep and defines some pref- erence factors for comparing different orderings. Those orderings in which rotations happen on elements which have not recently been coupled are preferred. Subject to this, Brent and Luk[S] use :(n even) processors to othogonalise the column pairs of an n-column matrix in parallel. Each processor Pk,1 5 k 5 5 is assigned the two adjacent columns (2k - 1 , 2 k ) in two variables uk and LI, . After orthogonalisation a synchronisation happens to permit all processors to exchange their left and right most columns. Processor Pk ( k # 1 and k # :) sends the variables Lk and u k respectively to processors 9 - 1 and Pk+1 and receives variables uk-1 and Lk+l respectively from 4 - 1

and Pk+1. Processor PI permanently maintains variable Ul, sends variable L2 to processor P2 and receives from it variable Lq. Processor P2 receives variable Un-3 from processor P( 5 -1), sends it variable U,. The convergence proof of [8] does not apply to the Brent and Luk parallel solution. Meanwile, one can use the threshold approach of [24] to enforce convergence. The number of steps needed

( ( 1 , 2 ) , ( 1 , 3 ) . 4 , n) , (2,3)...(2, n)(3,4) . . . (n - 1 ,

to complete a sweep in parallel on 5 processors is reduced to n - 1 (fig.l(a)). For very large matrices, it is not possible

(a) Pairwise distri- (b) Block Distribution bution

Figure 1. Columns Orthogonlisation on multiprocessor computers

to have a big amount of processors. Virtual processors can be conceptually used[4] to fill the gap between the physical processors and the number of processors needed. Some message passing programs provide the possibility of declar- ing as many processors as needed. Using virtual processors and given their management to the message passing pro- gram will unnecessarily increase the communication cost. It was then preferable for us to divide the matrix in blocks and assign each to a physical processor(fig.1 (b)). Each processor will then orthogonalise all column pairs in the block assigned to it and exchange the left and right most columns with neighbour processors. This method has been implemented on a cluster of workstations. In the next sections we explain how an efficient use of reconfigurable device can increase the overall performance.

3 The Distributed Reconfigurable System

The Implementation of the SVD on a DRS was moti- vated by the following observations:

adding processing elements on a node can help de- creasing the communication cost on a parallel system. In particular, if message passing is not efficient.

the time used for floating point operations can be con- siderably reduced by using a more efficient hardware implementation.

combining many instructions into just one macro- instruction in the FPGA to compute a large amount of data in a burst fashion can drastically increase the performance of a system.

40

the possibility during run time to change the circuitry configuration of FPGAs in order to accomodate a new computation can help accelerating the computation of different modules of an application at run time using only one device.

When a dataflow oriented function is expected to be too slow on a Central Processing Unit(CPU), it is time to think of a hardware implementation, if the function is inherently parallel. This is usually done in ASIC in which the function is hardwired once and cannot be changed again. FPGAs are programmable devices with the possibility to change their circuitry configuration at run time in order to implement new hardware functions. Further on we call such device a Reconfigurable Processing Unit(RPU). Combining the dataflow aspect and the inherent parallelism which characterises some classes of functions, FPGAs can be used to implement those functions more efficiently than CPUs and more flexible than ASICs[ 1, 11, 121. The possibility of reconfiguration during run-time of those devices has made them an attractive solution. FPGAs have successfully been used to provide fast computation in many application ar- eas including text and image processing( [ 18, 13,9, 191) and floating point computation [22] The model targeted in this paper is called a Distributed Re- configurable System. It is a cluster of workstations with boards[ 171 equipped with some FPGAs connected to some nodes (figure.2). A node in such a cluster is either a set of CPUs or a set of CPUs connected to some RPUs. The CPUs and RPUs communicate through local connections at the node level. Nodes are connected together via a Local Area Network, in our case through SCI(Scalab1e Coherent Interface)-Network. The set of CPUs on a node can be, for example, an SMP(Symmetric Multiprocessor) machine and the set of RPUs on a node can be a set of FPGA on a Board connected to the PCI-bus. Such an architecture provides us with three levels of parallelism. The first level is the system level. It consists of all nodes in the cluster. At this level, some message passing interface can be used to implement algorithms in parallel. The second level is the node level, where the RPUs and the CPUs are locally connected and share some resources like memory together. The third level is the RPU level. At this level, some fine grained functions can be implemented to run in parallel in the reconfigurable hardware. IP(intellectua1 Property) cores, that we simply call cores are developed and provided via internet and other channels by companies specialised in function implementation for RPUs. They are designed by teams of experts, who have a high degree of understanding of the FPGAs structure and give the best to provide the user with efficient blocks. For the purpose of reconfiguration, we consider that a number of cores which can be parameterised at run-time and down- loaded in to RPUs are available in a database.

4

__---.__

' CPU,

Figure 2. A Distributed Reconfigurable Sys- tem

Block Orthogonalisation on one node

In this section we will consider the repartition of the problem to be solved on a node of the DRS. Each step of Jacobi-like methods for the SVD is dominated by the pairwise orthogonalisation of a pair of columns. This re- quires the computation of the sine and cosine of the rotation *angle, updating the column elements and testing for convergence of the column pair with the threshold method of Rustihauser[20] and the parameters as defined in [23]. Computations are not carried out only on the first level(system level) of parallelism on our system, but also on the second level(the node level). At this level, the speed and the granularity of functions to be handled on the CPU and RPU will determine the number of column pairs to be assigned to each PU(processing unit). If a block of size n (number of columns) is allocated to a node, then the partitioning of the block happens as described in [7]: let tcpu be the time needed to execute a macro instruction on the CPU and trPu be the time needed to execute the same macro instruction on the RPU. As stated in [7] The following equation should hold.

(12)

Having distributed the columns to be handled among the PUS, indication should be given to the PUS on where to find the columns to orthogonalise. The PUS can now operate in parallel to complete the tasks. Because of the difficulty for the RPUs in the current system to work in a stand-alone way, their control is given to a process. Such a process will require CPU time only to reconfigure the RPU, to start and stop the transfer of data needed by the RPU and read the computation results. With this the amount of CPU-time needed by this process will remain low. The CPUs and the RPUs can then work in parallel.

t c p u n c p u

t r p u n r p u M - with ncPu + nrPu = n -

4.1 Reconfiguration

For the purpose of two columns orthogonalisation, three different circuitries for the needed macro functions are im-

41

plemented and stored in bitstream files. Those macros cal- culate the dot products of two column-vectors as defined in equation (7) for the generation of the cosine and sine of the rotation angle, the column element updating and the convergence test. For this experiment, the FPGAs model 4062XL of Xilinx has been used. Because of the impossibility of partly reconfiguring the FPGAs 4062XL, we choose the reconfiguration to be done by completely changing the current configuration of an RPU by a new one. The algorithm on

Columns elements updating

.- B c)

4

RPU Steps

Reconfiguration for cos0 and sin0 calculation.

1 2 I cos0 and sin0 calculation

orthogonality test 6

Figure 3. Flow chart for the Hestenes method on a DRS-Node

one node is shown in fig 3. After balancing the columns between the CPU and the RPU as stated in equation (12), a PU is reconfigured if necessary for the next calculation (steps 1, 3 and 5). This step is done by the CPU which downloads the core for the next macro to the RPU. After reconfiguration, calculations happen on both the CPU and RPU in parallel (steps 2, 4 and 6). Repeating steps 1 to 6 for each column pair will lead to very big reconfiguration overhead since about 2 seconds have to be added to the time needed to orthogonalise two columns. It means that the reconfiguration time is greater than the computaion time. Fortunately a good restructuration of the algorithm will help to reduce the reconfiguration overhead. Instead of reconfiguring the RPUs just for a computation on two columns, we reconfigure the device only once for all the columns pairs allocated to the RPU. Steps 3 to 6 can also be carried in just one reconfiguration and one computation where the updating and the test of convergence cores are merged to form one core for the two operations. We can even eliminate reconfiguration overhead by merging all the cores needed during the computation in just one core with different activation sig- nals. When a core is needed, it’s signal will simply be acti- vated in the RPUs. With this the total reconfiguration time is close to zero.

4.2 Performance

The performance of such a distributed algorithm with hard- and software-cooperation depends on different cri- teria. As to the two main levels of distribution they can be divided into system level and node level. At the system level the communication between nodes is done via an SCI-Network. We use a message passing interface (MPI) to implement the communication. Because we have no in- fluence on the MPI we decide to deal with the problem at the node level after the partitioning of the matrix across all the nodes of our system. In the next section we describe the performance measurement only at the node level with the assumption that communication managed by the MPI will not affect the speedup of the algorithm on the DRS.

4.2.1 Node level

Block orthogonalisation on a node is dominated by the calculation of the rotation angle for column pair orthogonalisation. It is based on the dot product computation of two vectors representing the involved columns. We focus on this because it is the most time consuming operation. A Multiply Accumulation(MAC) macro instruction (Fig.4) on 32 bits floating point numbers in the IEEE standard 754- 1985[15] is used for the dot products computations. The

Y i My Figure 4. MAC-Module

core for this MAC macro-instruction has a latency of 46 ns for the FPGA Xilinx 4062XL. That means we will theoretically reach the datarate of which is far greater than that of a Pentium I11 processor with 450Mhz(see table). If we consider a matrix of size lo6 x lo6 distributed across 10 nodes, each node will be dealing with a lo6 x lo5 block matrix. In practice term by documents matrices are quite sparse. They contain only about 0.002% non zero entries[2]. The total number of MAC-operations required to complete a sweep is lo6. This can theoretically be done in 4600s. The total number of sweeps needed to complete SVD of an m x n matrix is conjecture by Brent and Luk[5, 41 to be log(n). With this the time needed to complete the SVD of a lo6 x lo6 matrix on a DRS with 10 nodes will be in the range of 17H. This performance is unlikely to be reached by that of [2] even if their amount of ressources is multiplied by 10. The reachable datarate of our system is about for Direct Memory transfer(DMA). This will affect the computation time of the SVD on our system, but

42

Software MAC-Core FPGAaccess(DMA) Target Pentium 111-450 Xilinx XC-4000 Raptor-board Single MAC 365 ns 46 ns 85 ns MACls 2.14 M 20.7 M 11.8 M Data rate 20.9 MBIs 165.9 MBIs 90.0 MBIs

Table 1. performance

[6] S. Deerwester, S. Dumai, G. Furnas, T. Landauer, and R. Harshmann. Indexing by latent semantic analysis. Jour- nal ofAmerican Society for Information Science, 41 (6):391- 407, 1990.

we will still be far better than that of [ 2 ] .

The system described in this paper is currently implemented on the system described in section 3 with 8 nodes. Each nodes is equipped with two boards. Each board has one Xilinx Virtex FPGA or two Xilinx 4000XL FPGAs. The system is currently running as a pure software solution, but we are working to integrate the FPGAs in the computation as described in the former sections.

5 Conclusion

In this paper we have dealt with the implementation of the singular value decomposition of big matrices on a distributed reconfigurable system. We have shown that RPU and CPU can cope together in a distributed environment to obtain a better solution for the SVD. We explained how the reconfiguration overhead can be eliminated through an efficient implemetation of cores and a reorganization of the algorithm. Using DMA-transfers coupled with efficient implementation leads to a performance that justifies the employment of FPGAs for matrix computations. We keep on developing and try to better understand our environment because we will surely gain additional speedup by an efficient use of our achitecture. The work we have done will be integrated in a complete search engine that has recently been implemented.

References

[ I ] A. M. S. Adario and S. Bampi. Reconfigurable computing: Viable applications and trends. In IFIP TCl0 WGJ0.5 10 Int. Conk on Very Large Scale Integration( VLSI'99), pages 583-594, Lisboa, Portugal, 1999. IFIP.

[2] M. Berry, T. Do, G. O'Brien, V. Krishna, and S. Varadhan. Using linear algebra for information retrieval. 1. Soc. Indust. Appl. Math., 37(4):573-595, 1995.

[3] M. Berry, T. Do, G. O'Brien, V. Krishna, and S. Varadhan. SVDPACK(Version 1.0) User's Guide, 1996.

[4] R. P. Brent. Parallel algorithms in linear algebra. In Proc. Second NEC Research Symposium, Tsukuba, Japan, August 1991.

[5] R. P. Brent and F. T. Luk. The solution of singular-value and eigen-value problems on multiprocessor arrays. SIAM J. Sci. Star. Comput., 6( 1):69-84, 1985.

plus-variable" structure computer for computation of eigenvalues and eigenvectors of real symetric matrices. Journal of the ACM, 9:41-60, 1962.

[8] G. E. Forsythe and P. Henrici. The cyclic jacobi method for computing the principal values of a complex matrix. Trans. AmeL Math. Soc., 94:l-23, 1960.

[9] P. Foulk. Data-folding in SRAM configurable FPGA. In IEEE Workshop on FPGAs for Custom Computing Ma- chines, pages 163-171. IEEE, 1993.

[IO] G. H. Golub and C. F. V. Loan. Matrix Computations. North Oxford Academic Publisching, 1983.

[ 1 I] S. Guccione. Programming Fine-Grained Reconfgurable Architechture. PhD thesis, The University of Texas at Austin, May 1995.

[ 121 S. Guccione and D. Levi. The advantages of run-time reconfiguration. In Proc. SPIE 3844 FPGAs for Computing and Applications, pages 87-92. John Chewel et al, Eds, Sept. 1999.

131 B. Gunther and G. Milne. Accessing document relevance with run-time reconfigurable machines. In IEEE Workshop on FPGAs for Custom Computing Machines, pages 9-16, Napa California, 1996. IEEE.

141 E. R. Hansen. On cyclic jacobi methods. J. Soc. Indust. Appl. Math., 11(2):448459, 1963.

151 J. L. Hennessy and D. A. Patterson. Compurer Architec- ture A Quantitative Approach. Morgan Kaufmann Publish- ers Inc, 1990.

161 M. R. Hestenes. Inversion of matrices by biorthogonaliza- tion and related results. J. Soc. Indust. Appl. Math., 6(1):51- 90, 1958.

171 H. Kalte, M. Pormann, and U. Rueckert. Rapid prototyping system f 'ur dynamisch rekonfigurierbarer hardware struck- turen. In AES 2000, pages 149-157, 2000.

[ 181 S.-M. Ludwig. Hades-Fast Hardware Synthesis Tools and a Reconfgurable Coprocessor. PhD thesis, Swiss Federal Institute of Technologie, Zurich, 1997.

[ 191 L. Minzer. Programmable silicon for embedde signal processing. Embedded Systems Programming, pages 1 10-133, March 2000.

[20] H. Rustishauser. The jacobi method for real symetric matrices. Handbook for Automatic Computation, Vol 2 (linear Algebra):202-21 I , 1971.

[21] G. Salton. The SMART Retrieval System. Prentice HallJnc, 1971.

[22] N. Shirazi, A. Walters, and P. Athanas. Quantitative analysis of floating point arithmetic on fpga based custom computing machines. In IEEE Symposium on FPGAsfor Custom Com- puting Machines, Napa Valley, California, 1995. IEEE.

[23] G. R. G. S. J. Thomas. An optimal parallel jacobi-like solution method for the singular value decomposition. In Proc. Int. Conk on Parallel Processing, January 1988.

[24] J. H. Wilkinson. The algebraic eigenvalue problem. Oxford University Press, 1965.

43

[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

Documents