block householder

Solution of Large, Dense Symmetric Generalized Eigenvalue Problems Using Secondary Storage

ROGER G. GRIMES and HORST D. SIMON Boeing Computer Services

This paper describes a new implementation of algorithms for solving large, dense symmetric eigenproblems AX = BXA, where the matrices A and B are too large to fit in the central memory of the computer. Here A is assumed to be symmetric, and B symmetric positive definite. A combination of block Cholesky and block Householder transformations are used to reduce the problem to a symmetric banded eigenproblem whose eigenvalues can be computed in central memory. Inverse iteration is applied to the banded matrix to compute selected eigenvectors, which are then transformed back to eigenvectors of the original problem. This method is especially suitable for the solution of large eigenproblems arising in quantum physics, using a vector supercomputer with fast secondary storage device such as the Cray X-MP with SSD. Some numerical results demonstrate the efficiency of the new implementation.

Categories and Subject Descriptors: F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems-computation on matrices; G.1.3 [Numerical Analysis]: Numerical Linear Algebra-eigenualues, sparse and very large systems

General Terms: Algorithms

Additional Key Words and Phrases: Block Householder reduction, eigenvalue problems, out-of-core algorithms, supercomputers, vector computers

1. INTRODUCTION

Quantum mechanical bandstructure computations require repeated computation of a large number of eigenvalues of a symmetric generalized eigenproblem [9]. After the eigenvalues are examined, a selected number of eigenvectors are required in order to continue the computations. This application is memory- limited on most of today’s supercomputers with the exception of the Cray-2. For example, on a 4-Mword Cray X-MP/24, eigenvalue problems of order up to 1900 can be solved with a modification of standard in-core software such as EISPACK. Several applications require the solution of problems of about twice that size.

The authors gratefully acknowledge support through NSF grant ASC-8519354. Authors’ current addresses: R. G. Grimes, Boeing Computer Services, Engineering Scientific Services Division, P.O.B. 24346, M/S 7L-21, Seattle, WA 98124-0346; H. D. Simon, NAS Systems Division, NASA, Ames Research Center, Moffett Field, CA 94035. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1988 ACM 0098-3500/88/0900-0241501.50

ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988, Pages 241-256.

242 - R. G. Grimes and H. D. Simon

This need for an efficient out-of-core solution algorithm for generalized eigenvalue problems motivated our research.

The efficient in-core solution of the symmetric generalized eigenproblem

AX = BXA, (1)

where A and B fit into memory, has been addressed in [9]. The authors extended EISPACK [7, 161 to include a path for the generalized eigenproblem where A and B are in packed symmetric storage. They also applied a block-shifted Lanczos algorithm to the same problem.

This paper describes an algorithm developed to solve (1) where A and B are too large to fit in central memory. The key computational element that requires an efficient implementation in eigenvalue calculations for dense matrices is the Householder reduction to tridiagonal form. Even though the in-core implementation of a Householder reduction is well understood, an out-of-core Householder reduction is far from trivial. The main problem is that in each of the n steps of the Householder reduction an access to the unreduced remainder of the matrix appears to be necessary. This seems to indicate that an order of O(n3) I/O transfers are necessary. The main contribution of this paper is the application of a block algorithm that reduces the I/O requirements to O(n3/p), where p is the block size. With an appropriately chosen block size, the I/O requirements can be reduced to a level where they are completely dominated by the execution time.

The approach is to first reduce (1) to standard form by applying the Cholesky factor L of B as follows:

(L-lAL-T)(LTX) = (LTX)A

or

CY = YA, (2)

where

C = L-lAL-T

Y = LTX. (3)

The matrix B is factored using the block Cholesky algorithm where B is initially stored in secondary storage and is overwritten with its factor L [8, 171. Since B is assumed to be symmetric positive definite, the Choleski factorization exists, and no pivoting is necessary.

The next step in the standard EISPACK approach would be to reduce C to tridiagonal form using a sequence of Householder transformations. The application of Householder transformations to reduce C to tridiagonal form requires access to all of the unreduced portion of C for the reduction of each column. This access requirement would incur O(n3) I/O transfers. This amount of I/O is too high for an out-of-core implementation, since it would effectively limit the efficiency of the algorithm by the transfer speed to and from secondary storage.

Instead, a block Householder transformation is used to reduce C to a band matrix, where the band matrix will have small enough bandwidth to be held in central memory. We partition the matrix C symmetrically into an m-by-m block ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988.

Solution of Large, Dense Symmetric Generalized Eigenvalue Problems l 243

matrix. Each block has p rows and columns, except for the blocks in the last block row (column) which may have fewer than p columns (rows). Only the block lower triangular part of the coefficient matrix is stored on the secondary storage device. For each block column we compute the QR factorization of the matrix comprised of the blocks below the diagonal block. The product of Householder transformations used to compute the QR factorization is accumulated in the WY representation described in [2]. This representation is then applied simultaneously to the left and right of the remainder of the matrix. When all block columns except the last are reduced in this fashion, the result is a band matrix with bandwidth p which is similar to the matrix C in (2). The I/O requirements of the block Householder transformation are of O(n3/p). The performance of the algorithm will no longer be I/O-bound for even small block sizes (say p > 10). Obviously, the precise value of p, for which the processing time will begin to dominate, will depend on the ratio of processor speed and I/O speed of a particular machine.

The banded eigenproblem is then transformed to tridiagonal form using an enhanced version of EISPACK subroutine BANDR. Finally, EISPACK subroutine TQLRAT is used to compute the eigenvalues.

Specified eigenvectors are computed by first computing the associated eigenvectors of the band matrix and then back-transforming them to Y using the accumulated block Householder transformations. Application of L to Y back- transforms Y to X, the eigenvectors of the original problem (1). The eigenvectors of the band matrix are computed using inverse iteration implemented in a much modified version of EISPACK subroutine BANDV.

This algorithm has been implemented in a pair of subroutines HSSXGV and HSSXGl which, respectively, compute the eigenvalues and a specified set of eigenvectors of (1). The usage of these two subroutines as well as subroutines HSSXEV and HSSXEl, which compute the eigenvalues and specified eigenvectors of (2), is described in [lo].

Section 2 describes in further detail the reduction of the generalized problem to standard form (3). Section 3 discusses the implementation of the block Householder reduction to banded form. Section 4 describes the modifications, which were applied to several EISPACK subroutines, for band matrices. These modifications resulted in the increased efficiency of several routines. The computation of the eigenvectors is discussed in Section 5. Some resource estimates, performance results, and comparisons with in-core algorithms are presented in the final sections.

Our target machine is the Cray X-MP with a Solidstate Storage Device (SSD). Current Cray configurations range from 64 Mword SSDs up to 1024 Mword SSDs. A 1024 Mword SSD can hold a matrix of order 32000, or, in our application, two symmetric packed matrices of the same order. The SSD has an asymptotic maximum transfer rate of about 150 million words per second. This means that the whole contents of the SSD could be read in a few seconds (if there were enough memory). Another way of appreciating the fast rate of the SSD is by comparing it to the peak performance of the single processor Cray X-MP, which is about 210 million floating point operations per second (MFLOPS). Thus the overall speed of the machine is very well in balance with fast secondary memory

ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988.

244 l FL G. Grimes and H. D. Simon

access times. By comparison, the fastest disks available on a Cray X-MP are an order of magnitude slower. A DD-49 disk can hold 150 Mwords and has an asymptotic transfer rate of 12.5 Mwords per second. In spite of the focus on this particular target machine, we have attempted to develop a portable implementation. Indeed, our software has been successfully tested on a variety of machines such as a Sun 3/260 workstation, a Cyber 875, a SCS-40, and an p-Vax.

2. REDUCTION TO STANDARD FORM

The first step in the reduction to standard form is the Cholesky factorization of B. The out-of-core block factorization of a full matrix is a standard computational task, and efficient software is available. Here we use a row-oriented implementation which computes the lower triangular matrix L [B, 171. We do not discuss this algorithm here in more detail, except for two facts. The implementation [17] is written in a way to make extensive use of matrix-matrix multiplications a basic computational kernel (see [6, 131). By providing a specially tailored version of this kernel, very high computational speeds can be obtained on vector computers. For example, on a Cray X-MP the implementation in [17], performs at 192 Mflops for a system of order 3000 using the SSD as a secondary storage device. The second fact which should be mentioned is that the partitioning of B into blocks can be independent from the partitioning of A; in other words, the blocking of B for the Cholesky factorization is not required to conform to the blocking of A for the block Householder reduction.

Assuming that there is existing software for computing the Cholesky factor L of B using secondary storage and for performing forward elimination and back substitution with L, we partition the matrix A symmetrically into an m-by-m block matrix. Each block has p rows and columns, except for the blocks in the last block row (column) of A which may have fewer than p columns (rows), that is,

A 191 A& . . . A:,, A 291 I Aw . . . A:,,

A= . . . . . (4) * .

A m,1 : * *I Am,2 . . - Am,,

The blocks of the lower triangle of A are assumed to be stored in secondary storage.

The reduction of the eigenproblem to standard form given by (2) is accom- plished as follows. First the lower block triangle of X = L-‘A is computed, one block column at a time, starting with the last block column. The whole column is computed, but only the lower block triangle is stored. Let X = L-IA, then C = XLeT is computed next by reading in one column of XT at a time and performing forward elimination in order to compute CT = LelXT. Because of symmetry, it is only necessary to carry out the forward elimination of the kth block column up to block k. The result is then stored, and the lower block triangle of C is obtained as a result.

Even though the intermediate result X is not symmetric, this procedure exploits the symmetry of the problem. Both the number of operations and the (out-of- ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988.

Solution of Large, Dense Symmetric Generalized Eigenvalue Problems 245

core) storage requirements are reduced, since the computation of the full C is avoided. The detailed algorithm for this reduction process is given below.

Algorithm 2.1. Reduction to Standard Form

Compute the Cholesky factor of L of B. for k = m, 1, -1

forj = 1, k - 1 Read in block A,j and set Xj = A;rk.

end for forj=k,m

Read in block Aj,k and set Xj = Aj,k. end for Compute X = L-‘X. forj=k,m

Write X, over A,,k. end for

end for “The lower triangle of A has been overwritten with L-IA. Now form C = XLmT with C overwriting X.” fork = 1, m

for j = 1, k Read in block Xk,, and store in Yj.

end for Compute YT = L-’ Y’ performing the forward elimination only to block k of Y. forj= 1, k

Write block Yj over Xk,j. end for

end for

3. BLOCK HOUSEHOLDER REDUCTION TO BANDED FORM

The symmetric generalized eigenproblem of (1) has now been reduced to a symmetric standard eigenproblem of (2). The next step of the algorithm is to reduce (2) to banded form by a sequence of block Householder transformations. Each individual block Householder transformation is implemented using the Bischof and vanloan’s WY representation [2] of a product of Householder transformations. This form has proven to be useful in implementing the QR factorization in a parallel processing environment and in obtaining optimal performance in FORTRAN implementations of the QR factorization on vector computers [l, 121.

The product of p Householder transformations can be represented as

HlHz . . . HP = I - WYT,

where W and Y are matrices with p columns [2]. Let the matrix C be symmetrically partioned as



and let M be the first block column of C excluding the diagonal block

Let HI, H2,. . . , HP be the Householder transformation from the QR factorization of M, where each Hi has the form

Hi = I - 2UiUTa

The Householder transformations are accumulated into the WY representation with the columns wi and yi of W and Y defined as follows:

Wi = 2.3;

yi = HpHp-1 * * * Hi+lUi.

It is easily demonstrated that

H,Hz . . . HP = I - WYT.

Let Q be a similarity transformation defined as

It can easily be shown that Q is indeed a similarity transformation. Furthermore, when applied to C, Q reduces the first block column to banded form with half bandwidth p + 1.

(5)

where R is the upper triangular matrix from the QR factorization of M and Ci,j denotes the blocks of C that are modified from the previous step.

The standard trick in the single vector case of reducing the amount of computation for symmetrically applying a single Householder reduction can be applied in the block context as well. Let C (‘) be the matrix consisting of the second through mth block rows and columns of C. Then (5) reduces to forming

P2) = (I - WYT)TC’2’(I - WY*). (6)

Expanding the expression yields

c(2) = ~'2' _ ywTc'2' - C'2'WYT + YWTC'2'WYT. (7)

Let S = Cc2) W and V = WTCc2’ W. Substituting S and V into (7) yields

C’“’ = ~(2) - ysT - sy7' + yvyT



or

,(2)42~-.(,+Y~)-(s-~..)Y~. (8)

Let T = S - i YV and substitute into (8) to obtain

e(2) = c(z) - y’j”T - TyT. (9)

Thus the cost of forming Ci(‘) is reduced to the cost of multiplying with C(‘), W, and Y. Furthermore, the matrix C’*’ is accessed only 3 times per block row (2 fetches and 1 store). All these operations are matrix-matrix multiplications, which can be performed very efficiently on vector computers.

The same procedure will be applied repeatedly to the first block column of the remainder of the matrix. After m - 1 steps, the dense matrix has then been reduced to banded form. The details of the reduction process are given in the algorithm below. In the following description, Cck+l) refers to block rows and columns lz + 1 through m of C.

Algorithm 3.1. Reduction to Banded Form

for k = 1, m - 1 forj = k + 1, m

Read in Cj,, and store in Mj. end for Compute the QR factorization of M accumulating the matrices W, Y, and R. Write R over block Ck+l,k. Write W and Y to secondary storage. Compute S = C’““’ W with S overwriting W. Compute V = WTS with V overwriting R. Compute T = S - $ YV with T overwriting W. forj=k+l,m for i = j, m

Read in C,, . C,,, = C,, - Y;TT - T, Y;. Write Out C,,j.

end for end for

end for

4. COMPUTING THE EIGENVALUES OF THE BAND MATRIX

We have thus reduced the large out-of-core problem to a standard in-core problem: the computation of the eigenvalues of a symmetric band matrix. The standard software for this task is EISPACK [7, 161. However, we found the EISPACK routines for this problem, BANDR and BANDV, to be comparatively inefficient, and therefore applied several modifications. These modifications are described here and in the next section.

We examined the available software for solving the banded eigenproblem- BANDR from EISPACK and VBANDR, the vectorized version of BANDR [ll]. BANDR uses modified Givens rotations to annihilate the nonzero entries within the band to reduce the band matrix to tridiagonal form. With this method, the zeroing of an entry in the band creates an extra element on the subdiagonal


248 l R. G. Grimes and H. D. Simon

immediately outside the band. This new entry must then be zeroed, which creates another new entry p rows further down the subdiagonal. Several rotations are carried out until the element is chased off the bottom of the matrix (see [ 151). This algorithm takes on the average n/2p rotations to chase away one nonzero in the band. Overall, approximately n”/2 rotations are computed and applied with vector length of p. The implementation of BANDR in EISPACK, even with the appropriate compiler directives, is very inefficient on the Cray X-MP.

Kaufman’s VBANDR uses Givens rotations to annihilate the nonzero entries within the band. Furthermore, it postpones the annihilation of the entries outside the band so that several entries can be annihilated simultaneously. It then vectorizes across the entries to be annihilated, instead of on the rows or columns of the band matrix. On the average, VBANDR simultaneously annihilates n/2p entries. VBANDR is more efficient than BANDR when p << n. Kaufman’s original results [ll] were on a Cray-lS, but they are essentially the same for the Cray X-MP.

Neither the performance of BANDR nor that of VBANDR was acceptable for the problems we were considering. Since BANDR seemed to be the likely candidate for further enhancements, we rewrote BANDR to use SROTG and SROT from the BLAS [13], which have very efficient implementations on the Cray X-MP in the Boeing computational kernel library VectorPak [17]. This modification to BANDR provided a factor of 10 speed-up over standard BANDR on problems with n > 500 and p = 50.

The tridiagonal eigenvalue problem is then solved with TQLRAT from EISPACK, which is known to be one of the fastest tridiagonal eigenvalue solvers. TREEQL, a recent successful implementation of Cuppen’s [4] divide-and- conquer algorithm for the tridiagonal eigenvalue problem by Dongarra and Sorenson [5], promised some additional speed-up. However, TREEQL computes both eigenvalues and vectors. Because of the size of the eigenvector matrix, the computation of the full spectrum is neither desired nor feasible in our application. TREEQL was therefore modified to avoid the storage of all eigenvectors. Only intermediate quantities were computed, which are related to the eigenvectors of the tridiagonal submatrices and are required for the computation of the updates. This modified version of TREEQL required a maximum of 16n storage locations, compared to the n2 words of memory required in the original TREEQL for the full eigenvector matrix. The modified TREEQL also performs fewer operations than the original TREEQL. Because the number of operations is determined by the number of deflations occuring in the algorithm, an operation count for the modified TREEQL is difficult to give. In some computational tests, however, it turned out that the modified TREEQL performed about the same as TQLRAT on matrices of order up to 1500 (for more details, see Section 7). Therefore, we decided not to include the modified TREEQL into the algorithm. These results were obtained using the single processor version of TREEQL. We did not use TREEQL as a multiprocessor algorithm, where its advantages are more obvious.

5. COMPUTATION OF EIGENVECTORS

After the computation of all the eigenvalues, control is returned to the user. Thus the user can examine the computed eigenvalues and decide how many (if any) eigenvectors to compute. Thus the last decision to be made in the development ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988


of this software package concerns the approach to be used for the computation of the selected eigenvectors.

The first approach was to compute the selected eigenvectors of the tridiagonal matrix using inverse iteration (subroutine TINVIT) from EISPACK. This, however, requires the back-transformation of the eigenvectors using the n2/2 Givens rotations generated by the modified version of BANDR. This seemed to be prohibitively expensive. The second approach was to compute the selected eigenvectors of the band matrix, also using inverse iteration. Subroutine BANDV of EISPACK performs this computation. We tried BANDV and found it to be very inefficient in both central memory requirements and central processing time.

An examination of EISPACK’s BANDV shows that it first computes a pivoted LU factorization of the band matrix. Only U, which now has a bandwidth of 2p, is saved and L is discarded. Since L is not available for a true implementation of inverse iteration, a heuristic, which uses back-solves with U, is used to compute the eigenvector. The convergence is apparently slow: EISPACK permits up to n iterations per eigenvector instead of the limit of 5 iterations allowed in other EISPACK subroutines using inverse iteration. BANDV was modified to compute an LDLT factorization of the shifted matrix. The factorization is modeled after an efficient implementation of SPBFA from LINPACK [3] on the Cray X-MP based on the LEVEL 2 BLAS kernel STRSV [6]. Also, the heuristic was replaced with true inverse iteration.

These two modifications give a loo-fold speed-up over the original implementation of BANDV on the Cray X-MP. One order of magnitude of speed-up is obtained from the replacement of the linear equation solution technique, the other is from faster convergence of inverse iteration over the original heuristic. Two other by-products of the enhancements are a reduction of storage required by BANDV to represent the shifted band matrix from 2np to np and, more importantly, more accurate eigenvectors.

The central memory requirement for computing the eigenvalues alone is O(3np). To compute eigenvectors, 0(2np), central memory is required for the band matrix and for workspace for carrying out inverse iteration. This leaves storage for p eigenvectors to be held in central memory. Thus the eigenvectors are computed in blocks of at most p eigenvectors at a time.

Reorthogonalization of the eigenvectors is required for eigenvectors corresponding to multiple eigenvalues, and is often needed when the eigenvalues are close, but not exactly, multiples. If ] Xi - Xi+1 1 < lo-“]] B ]I 1, then two neighboring eigenvalues Xi and Xi+l are considered to belong to a cluster. After all clusters have been determined, up to p eigenvectors are computed at a time, whereby all eigenvalues in a cluster are assigned to the same group of p eigenvalues. Then reorthogonalization is carried out within the cluster. Since all eigenvectors belonging to the cluster are in-core at this time, no extra I/O costs are incurred.

There are two possible drawbacks to this scheme. One is that obviously we will not be able to compute accurate eigenvectors if the multiplicity or the cluster size is greater than the block size p. In our current setup the block size is very large (p = 63), and this is unlikely to occur. In the event that very high multiplicities are indeed found, the user still has the option to increase the block


250 - Fi. G. Grimes and H. D. Simon

size and thus compute all the eigenvectors belonging to a cluster within one block.

The other possible difficulty arises when there is an inefficient allocation of clusters to groups of eigenvalues. A worst case is if there are about 2n/p clusters with p/2 + 1 eigenvalues each. In this case about twice as many groups of eigenvectors would be required than in the normal case. This increases I/O requirements for back transforming the eigenvectors. In other contexts [14], reorthogonalizations occupied a significant portion of the computation time. We did not find this to occur in our numerical experiments, which are reported in Section 7.

The algorithm to compute the eigenvectors of the band matrix and back- transform to eigenvectors of the original problem is given below.

Algorithm 5.1 Computation of Eigenvectors

for ivec = 1, numvec, p Using BANDV, compute the eigenvectors of the band matrix corresponding to the

next set of up to p eigenvalues specified by the user. The eigenvectors are stored in X. for k = m - 1, 1, -1

Read in Wand Y. x = (I - WY’)X.

end for x = L-TX. Write X to secondary storage.

end for

6. RESOURCE REQUIREMENTS

The software described above makes extensive use of both the central processing unit and secondary storage of a computer. This section discusses the requirements of the algorithm for secondary storage, amount of I/O transfer between central memory and secondary storage, storage requirements for central memory, and the number of floating point operations.

Secondary storage for this algorithm is used to hold B, which is overwritten with L; to hold A, which is overwritten with C, and later the band matrix; and to hold the matrices W and Y. Approximately n’/2 storage is used for each of the 4 matrices; thus the total amount of secondary storage required is approximately 2n2. A Solidstate Storage Device (SSD) with 128 million words of secondary storage with fast access on a Cray X-MP would allow problems of order up to 8000 before overflowing the SSD to slower secondary storage on disks.

The amount of I/O transfer between central memory and secondary storage storing the matrices A, C, and the resulting band matrix, is fn”/p real words. The I/O transfer for the unit storing L, the Cholesky factor of B, is n3/p. The total amount of I/O transfer is yn3/p.

The maximum central memory requirement for the above algorithm is

2np + max(4p2, np + 3n + p, np + n + p(p + l)),

where the three terms correspond to Cholesky factorization of B, reduction to banded form, and computation of the eigenvectors. A good working estimate for the amount of central memory required is 3np. For the size of problems being ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988.


considered (e.g., n = 5000 andp = 50), this is less than 1 million words of working storage.

The operation count for computing the eigenvalues of the eigenproblem in (1) by this approach is approximately yn3 + 10n2p + lower order terms. This operation count comes from analyzing the four major components of the algorithm. The operation counts for these components are as follows:

Component

Cholesky factorization

Reduction to standard form

Block Householder transformations

QR factorizations

Reduction of band matrix

Total

Operation Count

$2”

2n3

+n3 + 2n$

2n$

6n$

+n3 + 10nzp

For an execution with n = 1492 and p = 50, the monitored operation count was 13.3 X 10’. The operation count computed by the above formula is 13.2 X log, which is within 1% of the actual count.

The computational kernels used in the algorithm are Cholesky factorization, block solves with the Cholesky factor, QR factorization, matrix multiplication, and a banded eigenvalue solver. All of these kernels, except for those used in solving the banded eigenproblem, perform well on vector computers. In fact, they have computational rates in the 150 to 190 Mflop range on a Cray X-MP for the problem sizes being discussed in this application. The code is portable, and can be implemented efficiently on other vector supercomputers whenever high performance implementations of the Level 2 BLAS [6] are available.

7. PERFORMANCE RESULTS

After establishing that each component of the implementation of the above algorithm in HSSXGV and HSSXGl was executing at the expected efficiency, a parameter study for the optimal choice of block size p was performed. A small problem of n = 667 from [9] was chosen as the test problem. HSSXGV and HSSXGl were executed with varying block sizes. Table I lists the results of the study. The block size was chosen such that mod(p, 4) = 3 to avoid bank conflicts on the Cray X-MP. The central processing time is in seconds and computational speed is reported in millions of floating point operations per second (MFLOPS). The time reported for HSSXGl is the average time for computing one eigenvector. An optimal choice of block size is dependent on the number of eigenvectors to be computed. A choice for 63 for the block size is good on the Cray X-MP. Further experiments indicate that 63 is a good choice on larger problems as well. If a large number of eigenvectors is to be computed, a smaller block size might be in order.

Table II compares the combination of HSSXGV and HSSXGl with the performance of the symmetric packed storage version of EISPACK and the block Lanczos algorithm described in [9] on the same eigenproblems in that paper. HSSXGV and HSSXGl used block size 63 for all problems. The statement of



Table I. Various Choices of Block Size for n = 667 (Execution Times in Seconds)

HSSXGV HSSXGV HSSXGl HSSXGl W time MFLOPS time MFLOPS

27 11.92 100.4 39 11.08 111.8 47 10.75 117.6 59 10.66 122.1 63 10.37 126.6 67 11.31 117.3 79 11.14 122.0 a7 11.02 125.3 99 11.25 125.4

.040 ,043 .048 .058 ,061 .063 .073 .078

48.0 57.5 60.8 64.3 66.1 69.1 74.5 79.8 84.7

Table II. Comparison of Performance with In-Core Methods

(Execution Times in Seconds)

Number of HSSXGV and Block n eigenvectors EISPACK HSSXGl Lanczos

219 22 1.37 1.07 1.37 667 32 14.73 12.33 8.25 992 35 37.33 35.38 17.00

1496 35 108.54 88.74 47.70

Table III. Comparison of HSSXGV versus Block Lanczos on Matrix of Order 992

(Execution Times in Seconds)

Number of HSSXGV and Block Eigenvectors HSSXGl Lanczos

20 33.77 21.20 40 35.70 20.62 60 37.78 27.02 80 39.69 30.78

100 41.32 39.35 120 43.66 53.91 140 45.58 62.35

the eigenproblem is to compute all eigenvalues and eigenvectors in the interval [-1.0,0.30].

HSSXGV and HSSXGl not only allow larger problem sizes than the symmetric packed storage EISPACK path, but have an 18% performance increase because of the block nature of the computations. Block Lanczos is indeed faster than HSSXGV and HSSXGl for these eigenproblems. However, if more eigenvectors are required, HSSXGV and HSSXGl are more efficient than block Lanczos. For the problem of order II. = 992, the figures in Table III indicate that HSSXGV and HSSXGl are faster than the block Lanczos code if the number of required eigenvectors is increased to more than 100. We estimate that for the largest problem above, n = 1496, HSSXGV would become the algorithm of choice ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988.

Solution of Large, Dense Symmetric Generalized Eigenvalue Problems

Table IV. Performance in MFLOPS

l 253

n

219 667 992

1496

HSSXGV HSSXGl MFLOPS MFLOPS

14 51 127 66 128 72 161 80

Table V. Comparison of TREEQL and TQLRAT (Execution Times in Seconds)

Number of Number of n TQLRAT TREEQL levels deflations

219 0.75 0.93 4 13 667 10.37 11.38 6 135 992 31.97 33.87 7 260

1496 83.08 85.28 I 217

if 80 or more eigenvectors were required. Also, HSSXGV computes all the eigenvalues and assumes no prior knowledge of the spectrum. Both Lanczos and the EISPACK path require an interval as input, and proceed to compute all the eigenvalues in the interval. Hence HSSXGV in Table II has computed additional information.

The overall computation rate for the problems in Table II is given in Table IV. The highest performance with 161 MFLOPS was obtained on the largest problem. This number is quite remarkable since a significant amount of computation in the tridiagonal eigensolver is carried out in scalar mode.

The overall change in performance when using TREEQL instead of TQLRAT can be seen from the results in Table V. For the runs in Table V, all eigenvalues, but no eigenvectors, were computed.

The number of levels and the number of deflations refer to quantities used in the implementation of TREEQL (see [5]): the numerical values of the computed eigenvalues agreed to within machine-precision. TREEQL apparently offers no advantage in this context for a single-processor implementation.

These results in no way contradict the very favorable ones obtained with TREEQL elsewhere (e.g., in [5]). As mentioned before, the version of TREEQL used here has been modified to compute eigenvalues only. The modified version of TREEQL requires considerably less arithmetic than the original one, which computes eigenvectors as well. However, the modified version requires still more arithmetic than a comparable tridiagonal eigenvalue solver based on the QL algorithm alone. Hence, the roughly comparable execution times indicate that TREEQL would probably be more efficient if eigenvectors were required.

A second point worth mentioning relates to the number of deflations reported above. The number of deflations is apparently close to a worst-case behavior for TREEQL. This behavior usually has been observed for random matrices. One could argue therefore that the example chosen here is a worst-case example, and thus insufficient to dismiss TREEQL. On the other hand, one could also argue


254 l R. G. Grimes and H. D. Simon

Table VI. Performance on Large Random Matrices

Memory CP time Rate Secondary I/O Wait Time n (Mwords) (se4 (Mflops) storage SSD DD-49

1000 0.40 29.8 141 2.16 0.48 290.09 2000 0.60 192.6 164 8.36 4.62 - 3000 0.79 579.7 180 18.61 13.84

Table VII. Performance on Large Sparse Matrices

CP time Rate Lanczos n (set) (Mflops) CP time

1919 98.9 116 83.1 1922 97.4 118 19.5 3562 514.5 131 8.43

that tridiagonal matrices obtained through a Householder or Givens reduction process are very unlikely to look like a tridiagonal matrix with entries (1, -2, l),

which is, in some sense, a best case for TREEQL. Since there was no clear cut advantage in using TREEQL, we stayed with a routine that has for many years been proven to be reliable and efficient.

In order to demonstrate the overall performance of the software on some problems that are too large to fit into central memory of the Cray X-MP, we performed two additional series of tests. In the first series we computed all eigenvalues and 50 eigenvectors of random matrices of orders n = 1000, 2000, and 3000. The results are presented in Table VI.

Table VI demonstrates the excellent asymptotic performance that can be obtained with HSSXGV on the Cray X-MP, as well as its capability of efficiently solving very large out-of-core problems. The cost using DD-49 disks was about 50% higher than using the SSD for n = 1000. Also, the I/O wait time was an order of magnitude larger than the CP time. In contrast, the I/O wait time for the SSD was negligible. Therefore, the larger problems were not run using the DD-49 disks. For the same test runs, the eigenvector computation using HSSXGl required 0.095, 0.226, and 0.389 seconds per eigenvector for the problems of size 1000, 2000, and 3000. The corresponding Mflops rates were 76, 91, and 103.

Finally, we tested the performance of HSSXEV on some large sparse problems. Here we did not want to utilize the sparsity, but only test the out-of-core capability on some real problems with known answers. In order to do so, the sparse problems were unpacked and stored in the full upper triangle of the coefficient matrix. Then HSSXEV was applied to this matrix. For comparison, we list the execution time of a block Lanczos algorithm for sparse matrices. The Lanczos algorithm was only required to compute the eigenvalues to about 8 correct digits. The eigenvalues computed with HSSXEV agreed with the Lanczos results to this number of digits. It should be noted that the execution times in Table VII cannot and should not be compared as times for two competing algorithms. The execution times listed for HSSXEV are for the computation of ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988.


all eigenvalues, but no eigenvectors. The execution times for block Lanczos are for the computation of 550, 200, and 10 eigenpairs, respectively.

8. SUMMARY

Software for the out-of-core solution of the symmetric generalized eigenproblem has been developed and implemented. Because of its block nature, this software is more efficient on vector computers than a related in-core algorithm. If the number of required eigenpairs is large, in particular in applications, where all eigenvalues and vectors are required, the new software is more efficient than a previous code of the authors [9] based on the Lanczos algorithm. The Lanczos algorithm remains the algorithm of choice for large sparse problems. Most importantly, the new software allows efficient solution of problems too large to fit in central memory, thus providing an important computational tool to the researches in quantum mechanics and other disciplines that generate large symmetric generalized eigenproblems.

ACKNOWLEDGMENT

We would like to thank Dan Pierce for modifying TREEQL and for carrying out the related numerical experiments.

REFERENCES

1. ARMSTRONG, J. Optimization of Householder Transformations, Part I: Linear Least Squares. CONVEX Computer Corp., 701 N. Plano Rd., Richardson, TX 75081.

2. BISCHOF, C., AND VAN LOAN, C. The WY representation for products of Householder matrices. SIAM J. Sci. Stat. Comput. 8 (1987), s2-~13.

3. BUNCH, J., DONGARRA, J., MOLER, C., AND STEWART, G. LINPACK User’s Guide. SIAM Philadelphia, 1979.

4. CUPPEN, J. J. A divide and conquer method for the symmetric tridiagonal eigenvalue problem. Numer. Math. 36 (1981) 177-195.

5. DONGARRA, J. J., AND SORENSEN, D. C. A fully parallel algorithm for the symmetric eigenvalue problem. SIAM J. Sci. Stat. Comput 8 (1987), s139-s154.

6. DONGARRA, J. J., Du CROZ, J., HAMMARLING, S., AND HANSON, R. Extended set of Fortran basic linear algebra subprograms. Argonne National Lab. Rep. ANL-MSC-TM-41 (Revision 3), 1986.

7. GARBOW, B. S., BOYLE, J. M., DONGARRA, J. J., AND MOLER, C. B. Matrix eigensystem routines-EISPACK guide extension. In Lecture Notes in Computer Sciences, Vol. 51, Springer- Verlag, Berlin, 1977.

8. GRIMES, R. Solving systems of large dense linear equations. Rep. ETA-TR-44, Boeing Computer Services, Seattle, Wash., Feb. 1987; submitted to Supercomputing and Its Applications.

9. GRIMES, R., KRAKAUER, H., LEWIS, J., SIMON, H., AND WEI, S. The solution of large dense generalized eigenvalue problems on the Cray X-MP/24 with SSD. J. Comput. Phys. 69,2 (1987), 471-481.

10. GRIMES, R., AND SIMON, H. Subroutines for the out-of-core solution of generalized symmetric eigenvalue problems. Rep. ETA-TR-54, Boeing Computer Services, 1987.

11. KAUFMAN, L. Banded eigenvalue solvers on vector machines. ACM Trans. Math. Softw. 10, 1 (1984), 73-86.

12. KAUFMAN, L., DONGARRA, J., AND HAMMARLING, S. Squeezing the most out of eigenvalue, solvers on high-performance computers. Linear Algebra Appl. 77 (1986), 113-136.

13. LAWSON, C., HANSON, R., KINCAID, D., AND KROGH, F. Basic linear algebra subprograms for FORTRAN usage. ACM Trans. Math. Softw. 5 (1979), 308-321.


256 - FL G. Grimes and H. D. Simon

14. Lo, S., PHILLIPE, B., AND SAMEH, A. A multiprocessor algorithm for the symmetric eigenvalue problem. SIAM J. Sci. Stat. Comput. 8 (1987), s155-~165.

15. PARLETT, B. N. The Symmetric Eigenualue Problem. Prentice Hall, Englewood Cliffs, N.J., 1980.

16. SMITH, B. T., BOYLE, J. M., DONGARRA, J. J., GARBOW, B. S., IKEBE, Y., KLEMA, V. C., AND MOLER, C. B. Matrix eigensystem routines-EISPACK guide. In Lecture Notes in Computer Sciences, Vol. 6, Springer-Verlag, Berlin, 1976.

17. VectorPuk Users Manual. Boeing Computer Services Dot. 20460-0501-Rl, 1987.

Received July 1987; accepted March 1988

ACM Transactions on Mathematical Software, Vol. 14, No. 3, September 1988

block householder

Documents

solution of problems

numerical algorithms

core algorithms

eigenvalue problems

symmetric banded eigenproblem

large number of eigenvalues

analysis of algorithms

core solution algorithm