parallel pseudospectral electronic structure: i. hartree-fock calculations

— —< <

Parallel Pseudospectral ElectronicStructure: I. Hartree]Fock Calculations

DAVID CHASMAN, MICHAEL D. BEACHY, LIMIN WANG,RICHARD A. FRIESNERDepartment of Chemistry, and Center for Biomolecular Simulation, Columbia University, New York,New York 10027

Received 23 December 1997; accepted 10 February 1998

ABSTRACT: We present an outline of the parallel implementation of ourpseudospectral electronic structure program, Jaguar, including the algorithmand timings for the Hartree]Fock and analytic gradient portions of the program.We also present the parallel algorithm and timings for our Lanczos eigenvectorrefinement code and demonstrate that its performance is superior to theScaLAPACK diagonalization routines. The overall efficiency of our code increasesas the size of the calculation is increased, demonstrating actual as well as

wtheoretical scalability. For our largest test system, alanine pentapeptide 818Ž . xbasis functions in the cc-pVTZ -f basis set , our Fock matrix assembly procedure

has an efficiency of nearly 90% on a 16-processor SP2 partition. The SCF portionŽ .for this case including eigenvector refinement has an overall efficiency of 87%

on a partition of 8 processors and 74% on a partition of 16 processors. Finally,our parallel gradient calculations have a parallel efficiency of 84% on 8 processors

Ž .for porphine 430 basis functions . Q 1998 John Wiley & Sons, Inc. J ComputChem 19: 1017]1029, 1998

Keywords: pseudospectral; parallel; Hartree]Fock; gradient; scalable

Correspondence to: R. A. FriesnerContractrgrant sponsor: National Science Foundation; con-

tractrgrant number: CHE9217368Contractrgrant sponsor: National Institutes of Health, Divi-

sion of Research Resources; contractrgrant number: P41RR06892

Contractrgrant sponsor: MetaCenterThis article contains Supplementary Material available from

the authors upon request or via the Internet at ftp.wiley.comrpublicr journalsr jccr suppmatr 19r1017 or http:rrjournals.wiley.comrjccr

( )Journal of Computational Chemistry, Vol. 19, No. 9, 1017]1029 1998Q 1998 John Wiley & Sons, Inc. CCC 0192-8651 / 98 / 091017-13

CHASMAN ET AL.

Introduction

arallelization of ab initio electronic structureP methods is an essential task if one wishes totreat large molecules in a cost-effective and timelyfashion. The latest generation of parallel machines,such as the IBM-SP2, combine the low cost perprocessor of workstation technologies with thereal-time throughput potential of a large super-computer, when provided with efficient parallelalgorithms. The system software, mathematical li-braries, and communications hardware and soft-ware on these machines are now such that theparallel implementation of large, complex codescan be accomplished with a reasonable level ofhuman effort. Additionally, the large local diskspace of the SP2 render it particularly suitable forquantum chemical computations, where files mustinvariably be read from disk for each iteration inproblems of significant size.

In this study, we describe a parallel algorithmŽ .for Hartree]Fock HF calculations using pseu-

Ž .dospectral PS numerical techniques as imple-mented in the Jaguar suite of ab initio electronicstructure code.1 Timing results for both single-point and analytical gradient calculations are pre-sented on the IBM SP2. Our code is written using astandard MPI protocol, however, and is thereforereadily portable to other parallel machines. Whilea number of parallel implementations of HF codeshave appeared elsewhere,2 ] 9 the algorithms em-ployed in the PS methodology are significantlydifferent from those in conventional electronicstructure approaches, and hence require a differentparallelization strategy. We have shown in a previ-ous publication10 that PS HF calculations are sub-stantially more efficient than alternative codes such

Žas Gaussian-92 by factors of three to seven de-pending upon the basis set and hardware plat-

.form for single-point energies of large systems.Gains are particularly large for high-quality basis

Ž .sets such as the Dunning cc-pVTZ -f basis. There-fore, an effective parallel implementation of the PSHF algorithms in Jaguar will result in a powerfuland practical tool that will extend the range ofchemical systems that can be profitably studied byab initio electronic structure methods.

A major objective in parallelizing computationalchemistry codes is the achievement of scalability;that is, the ability to utilize an arbitrarily largenumber of processors to solve a given problem

with a proportional reduction in wall-clock time.However, for Hartree]Fock calculations, there arefundamental limits on scalability that arise fromirreducible aspects of the calculations, such as ma-trix multiplies and matrix diagonalizations, whichare inherently quite difficult to parallelize effi-

Žciently. That is true within the current technology.Methods have been proposed11 that would doaway with the need for diagonalization, althoughit has not been demonstrated that these methods

.are better in practice as well as theory. In additionto matrix multiplies and diagonalizations, thereare practical challenges associated with achievingprecise load balancing in assembly of the Coulomband exchange operators. If one’s Fock matrix as-sembly time is sufficiently large compared withthese terms, then scalability can be demonstratedfor a large number of processors, but this simplymeans that the original single-node code washighly inefficient. In contrast, our exceptionallyefficient single-node performance means that Am-dahl’s law asserts itself much more quickly in theform of the terms previously mentioned. We haveattacked a part of this problem via a parallel im-plementation of the Lanczos diagonalization algo-rithm.12 This allows us to achieve an improvementof three to four times in this segment of the calcu-lation as compared with the ordinarily employedalternative of scalable LAPACK.13 However, this isinsufficient to eliminate some degradation in per-formance at the four- to 8-node level for the prob-lems we examine here.

Despite this difficulty, we do not believe thatthis observation has serious practical conse-quences. In our experience, the vast majority ofcomputationally oriented chemical laboratories areat any one time engaged in the study of largenumber of molecules that can be trivially dis-tributed across a parallel platform simply by run-ning multiple jobs. For this reason, it is not impor-tant, in a practical sense, to attain a theoreticallyperfect scalability for arbitrarily small problems.Rather, what is crucial about parallelization is theability to tackle large molecules with an acceptableturnaround time, while not sacrificing efficiency.This leads us to define ‘‘practical scalability’’ in adifferent manner: We consider our approach topossess practical scalability if, for a given problemsize, the number of nodes that can be used effi-

Žciently say, with 85]90% of the theoretical maxi-.mum performance is such that the wall-clock time

Žfor the job is reduced to an acceptable level e.g., afew hours for a single-point calculation, a few days

.for a geometry optimization . Our current imple-

VOL. 19, NO. 91018

PSEUDOSPECTRAL STRUCTURE: I

mentation achieves this objective, although quanti-tative improvements are still possible.

We have organized this study as follows: Wefirst present a brief overview of the pseudospectralHartree]Fock methodology to facilitate the discus-sion of its parallelization. We then describe indetail the algorithms we have developed for paral-lelizing different parts of the code. Results are then

w UUpresented using two basis sets 6-31G and Dun-Ž .xning cc-pVTZ -f for several single-point calcula-

tions, displaying the scaling of wall-clock timewith the number of processors. Results are alsopresented for gradient calculations using the 6-31GUU basis. Finally, we summarize the resultsand suggest future directions.

Pseudospectral Overview

The pseudospectral assembly of Coulomb andexchange matrix elements has previously been de-scribed in several studies,14 ] 20 and we shall simplypresent the formulas here without elaboration. TheCoulomb matrix element between basis functions< : < :i and j is:

Ž . Ž . Ž . Ž .J s Q g J g R g 1Ýi j i jg

Ž .where the physical-space Coulomb operator, J g ,is given by:

Ž . Ž . Ž .J g s A g r 2Ý k l k lg

Ž .Here, A g is a three-center, one-electron inte-k lŽ .gral potential integral representing the field at g

due to the product charge distribution of basisfunctions R and R at the grid point g and isk lgiven by:

Ž X . Ž X .R g R gk lX3Ž . Ž .A g s d g 3H Xk l < <g y g

Ž .In the formulas just expressed, R g is thejatomic basis function j evaluated at a grid point

Ž .g, r is the usual density matrix, and Q g is ak l ileast squares fitting operator:

y1† †w x Ž .Q s S R wR R w 4jiÝi g i j i g gj

which is designed to fit any right-hand sideŽ . Ž .R g A g in the region of space in which atomicj k l

< :basis function i has significant density. Here, the

matrix w is the diagonal matrix of grid weights. Ithas been shown previously10 that the use of aspecialized least squares operator with an opti-mized fitting basis is considerably more accuratefor a given number of grid points than any stan-dardized quadrature scheme. Our most accurategrid for the 6-31GUU basis uses 300]400 grid pointsper atom, whereas our most inexpensive grid usesonly 100 grid points per atom.

The exchange matrix elements K are given by:i j

Ž . Ž . Ž .K s Q g K g 5Ýi j i jg

where:

Ž . Ž . Ž . Ž .K g s A g s g 6Ýj jn nn

is the pseudospectral physical space exchange field.Ž .Here, the intermediate quantity s g is definedn

as:

Ž . Ž . Ž .s g s r R g 7Ýn n m mm

< :When i is a long-range function, the least squaresfit has to be carried out over a large portion of themolecule. This leads to a significant loss of accu-racy for a fixed number of grid points as well as anincreased effort in solving the normal equations.This problem has been addressed with a lengthscales algorithm, which is described in detail inref. 10. Briefly, the idea is to avoid matrix elementcalculations in which a diffuse function must beused in the least squares operator. When at leastone function is nondiffuse, this can be triviallyaccomplished by permuting the i and j indices so

< :the diffuse function is on the right. If both i and< :j are diffuse functions, more work is required,but it is possible to express both the Coulomb andexchange matrix elements so that a diffuse leastsquares operator is used only if all three functionsfollowing it are also diffuse.

Parallelization Scheme

There are five major parts of the pseudospectralHartree]Fock algorithm that require paralleliza-

Ž .tion. These are: a the generation of the leastŽ .squares operators; b calculations of the one-, two-,

and three-center, two-electron integrals via our21, 22 Ž .GHGP code ; c calculation of the three-center,

Ž .one-electron integrals; d the pseudospectral as-w Ž . Ž .x Ž .sembly eqs. 5 and 1 ; and e eigenvector re-

JOURNAL OF COMPUTATIONAL CHEMISTRY 1019

CHASMAN ET AL.

finement. We will present our parallelizationscheme in the order in which the computationaltasks are carried out in the program.

LEAST SQUARES ASSEMBLY

Before the SCF iterations begin, our least squarescode assembles the least squares operator Q of eq.Ž .4 , which transforms from physical space to spec-tral space. In practice, not all elements of the Qmatrix are actually computed, because most basisfunctions drop off sharply as a function of thedistance from their centers. Such functions areclassified as short-range functions and are groupedtogether by atom. The remaining functions areclassified as long-range functions, which are placedin a single group. The least squares operator isevaluated individually for each of the N q 1at omgroups. The N short-range fitting matrices areat omevaluated for the grid points that are within anadjustable cutoff distance of their atomic centers.The long-range fitting matrix is evaluated acrossall of the grid points. This decomposition of Qallows us to only consider small regions of spacefor the short-range basis functions. Further, as thesystem size is increased, it will be possible toemploy cutoff distances for even the long-rangefunctions. This implies that the evaluation of the

Ž .least-squares operators, Q , scales as N log Niwith system size, rather than N 2. The length scalesalgorithm also improves the accuracy obtainedwith a given number of grid points.20

The set of grid points associated with each atomis broken down into spatially contiguous groupsŽ .termed grid blocks , and the evaluation of thematrix elements from the short-range functionsthen takes place over nested groups of grid types,atoms, grid blocks, and, finally, grid points. First,the matrix elements:

w † x Ž . Ž . Ž .R wR s R g w R g 8i j Ý i g jg

are evaluated for a particular atom and grid type.Here, the indices i and j indicate fitting functionscentered on the atom for which the matrix is beingevaluated and the g is the grid point index. Thecost of this matrix multiplication is N 2 N ,f i t g r id

where N is the total number of grid points andg r idN is the number of fitting functions associatedf i t

w † xwith the atom for which R wR is being evalu-Žated. The number of fitting functions for a given

atom is generally around two to four times thenumber of basis functions on that atom, depending

on the specific basis set and accuracy of the grid. w † xbeing used. Once the R wR matrix for a given

atom and grid resolution has been assembled, eachw † xy1 wof the corresponding matrices S R wR see eq.

Ž .x w † x4 are formed using S and R wR and by solvingfor X in:

w † x Ž .R wR X s S 9

Solving this equation by Cholesky decompositionrequires on the order of N 3 operations. Eachf i tCholesky decomposition is carried out on the node

w † xwhich computed that particular R wR matrix.w † xThe formation of the matrices S R wR for each

grid resolution is distributed in the N q 1 tasksat omwcomprised of both the matrix multiplication eq.

Ž .x w Ž .x8 and the Cholesky decomposition eq. 9 . Af-w † xy1ter all S R wR matrices are completed, the re-

sults are broadcast to all the nodes. The ratio ofcomputation to communication is:

N 2 Nf it g r id Ž .102 Ž .N log Nf it pr oc

Because this ratio is proportional to the number ofpoints in the molecular grid over the log of thenumber of processors, we can rest assured that thisparallelization scheme will be successful.

ŽNext, the final Q matrix is assembled i.e., thew † xy1matrix S R wR is multiplied by the matrix

† .R w . These tasks can be distributed by grid blocks.The computational work in this final matrix multi-plication is proportional to N 2 N and the infor-f i t g r idmation that needs to be returned for each Q isN N , so the ratio of computation to communi-f i t g r idcation is proportional to N , or the number off i tfitting functions associated with each atom or thelong-range group. Because this ratio is roughlyconstant, and the number N may be fairly small,f i t

there is no guarantee that the distribution of thisfinal matrix multiplication will be practical. Thereare, however, a number of options available forthis matrix multiplication, depending on the loadbalance method being used in the pseudospectralFock matrix assembly. If a static load balance isbeing used, each processor simply builds Q for thegrid blocks it has been assigned. For this method,there is no communication cost. For a partiallydynamic assignment of grid blocks, each processorbuilds Q for its statically assigned grid blocks andall dynamically assigned grid blocks. In this case,the communication penalty is as noted earlier. Onesignificant gain in either wholly static or partiallystatic assignment of grid blocks is in the distribu-

VOL. 19, NO. 91020


tion of storage of the final Q operators, which totalsize N N and are the largest single storedb asi s g r idquantity in our PS HF implementation. It is alsopossible to run a calculation with completely dy-namic assignment of grid blocks, for which the fullQ operators are stored on each node. However, allcalculations presented here have used partiallydynamic grid block assignment.

MATRIX ASSEMBLY

After the least squares program has generatedthe matrices Q, any standard SCF procedure maybe employed to obtain an eigenfunction of themolecular Hamiltonian. The only modification ofour method is that the pseudospectral method isused in the construction of the relevant operators.

ŽFirst, the analytical corrections are computed see.‘‘Analytical Corrections’’ subsection . The calcula-

tion of the analytical corrections is decomposedusing columnar ‘‘strips’’ of the density matrix andmay be distributed across the processors eitherstatically or dynamically. Next, the Coulomb and

Ž .exchange matrices are assembled using eqs. 1Ž .and 5 . This work is broken into grid block tasks,

which correspond to groupings of the index g inŽ . Ž .eqs. 1 and 5 . This grid block work is either

Ždistributed statically, dynamically using the pool.of tasks paradigm , or partially dynamically. The

partially dynamic or completely static methodswin out as the size of the molecule grows for tworeasons. First, storage of the Q operators is notduplicated. Second, the Q operators need not bebroadcast to all nodes. For small- to medium-sizedmolecules, such as porphine, these considerationsare not important and the calculation can proceedwith dynamic assignment of all grid blocks andanalytic strips. We recognize that dynamic assign-ment of grid blocks through a shared counter isnot a truly scalable solution to the load balanceproblem, but it suffices for our stated goal ofreducing cycle times to reasonable levels for largeproblems.

Because there is no interdependency of the ana-lytical corrections and the pseudospectral assem-bly work, the two tasks can be overlapped in thedynamic grid block assignment methods. That is,as soon as all of the ‘‘strips’’ of analytical correc-tions have been completed, the idle processorsimmediately begin the assembly by grid blocks asis indicated in Figure 1. To improve load balanceduring operator assembly, we have varied the sizeof the grid blocks so that a variable fraction of thegrid blocks is no more than half the size of the

FIGURE 1. Task distribution for a partition of 4processors, carrying out an SCF calculation on amolecule which requires 8 strips of analytical correctionsand 20 grid blocks.

others and we have sorted the grid blocks in de-scending order of size. This is similar to the re-ordering of tasks by size done by Luthi et al.23 to¨improve load balance in their SCF parallelization.These two minor modifications are analogous tobuilding up a level wall with stones of varyingsizes. By laying down the largest stones first, andthen laying down smaller stones, it is more proba-ble that a level wall will be obtained. This point isillustrated in Figure 1.

The final Fock, exchange, and Coulomb matri-ces are assembled by simply carrying out a globalsum of the local copies of the matrices. The localcopies of the matrices are identical to the fullpseudospectral matrices except that the subscript

Ž . Ž .g in eqs. 1 and 5 has been restricted to the gridblocks that have been allocated to a particularprocessor. The local replication of these N 2 ma-b asi s

Žtrices is the only limitation imposed by storage as. Žopposed to CPU considerations in our code. Be-

cause all of the data associated with grid points isdistributed over processors, it does not constitute a

.barrier to the size of the system that can be treated.However, the code is written so that these arrayscan be stored on disk and read into the program in


CHASMAN ET AL.

a blocked fashion, so that the penalty in wall-clocktime is minimal. Thus, the number of basis func-tions that can be treated is determined by whatcan fit on a local disk. The SP2 that we have atColumbia currently has a 2.5-GB local disk, andmore current machines are typically equipped with9-GB or even 16-GB disks. A symmetric 5000 =5000 matrix requires about 95 MB of storage, so allof the N 2 matrices should easily fit into a giga-b asi s

byte of storage. With a 9-GB disk drive, a 12,000-basis-function calculation becomes quite feasible.This calculation is significantly larger than any

Žthat we are contemplating in the near future surelysuch a system is better treated via methods such as

.mixed quantum mechanicsrmolecular mechanicsand hence this aspect of the implementation is notas significant in a practical sense. And, althoughthe algorithm presented here uses replicated data

Žstructures for practical reasons ease of implemen-.tation , there is no reason that a code with dis-

tributed N 2 matrices could not be developed.b asi s

Once the assembly of the Fock matrix is com-plete, the eigenvectors are refined using either aserial matrix diagonalization, the ScaLAPACK13

Žparallel diagonalization specifically, subroutine.pdsyev, available from Netlib , or the parallel Lan-

Žcoz-based procedure see ‘‘Eigenvector Refine-.ment’’ subsection . The SCF procedure is then iter-

ated to convergence. All N 2 by N 2 matrixb asi s b asi s

multiplies are completed with the ScaLAPACKpdgemm subroutine.

Analytical Corrections

As described in the ‘‘Pseudospectral Overview’’Ž .section, pJaguar parallel Jaguar calculates ana-

lytic corrections for the one-center and two-centerspectral space terms of the Fock matrix, and forsome of the three-center terms as well. Specifically,for the Coulomb matrix elements, we calculate theanalytic terms:

Ž < . Ž .ij kl r 11Ý k lkl

for cases in which i, j, k, and l meet certain cutoffŽ < . Ž < .criteria and the term ij kl is of the form aa aa ,

Ž < . Ž < . Ž < . Ž < .aa ab , aa bb , ab ab , or aa bc , where a, b, and cindicate the atom upon which the function is cen-tered. Similar correction terms are computed forthe exchange operator, as detailed in ref. 10. The

Ž .corresponding pseudospectral terms in eqs. 1 and

Ž .5 for the appropriate i, j, k, and l are subtractedfrom the pseudospectral elements of J and K.

The calculation of the analytical corrections isdecomposed into tasks by breaking the densitymatrix into columnar strips. In addition, the sec-

Ž .ond column index is constrained so that all of thebasis functions associated with a given atom arecontained in a single strip. The restriction that thebasis functions associated with a given atom arenot allowed to cross a ‘‘strip boundary’’ has the

Ž < .advantage that all terms of the form aa bc rbcrequire only the matrix elements corresponding to

Ž < .atom b, whereas the terms of the form aa bc raarequire only the diagonal block r . This impliesaathat the only density, Coulomb, and exchange ma-trix entries that are accessed for the calculation ofthe analytical corrections for a given strip are thosecontained in the strip itself and those contained inthe atom diagonal blocks of the matrix. The sub-traction of the pseudospectral correction terms isdistributed over the grid blocks, which will bediscussed in the next subsection.

The analytical correction terms are evaluatedbefore the Coulomb and exchange matrix ele-ments. Because the analytical corrections are sim-ply added to the final summation of J and K or F,it is unnecessary to sum these terms until the fullpseudospectral assembly of the relevant operatorshas been completed. This allows us to assign theavailable processors to pseudospectral assembly

Ž .tasks described in the next section as soon as theanalytical correction tasks have been completed,without the need for a global synchronization op-eration. Thus, we are able to overlap the computa-tion of the analytical corrections and the pseu-dospectral assembly. For the static and partiallydynamic grid block assignment methods, the ana-lytic corrections are statically assigned to nodes.The number strips is set to be equal to the numberof processors, or some multiple of them. To achieveload balance, the division of strips is done asevenly as possible, within the constraint that func-tions on a given atomic center remain in the samestrip.

The communication costs associated with theanalytical corrections are negligible for dynamicgrid block assignment, because the only informa-tion that needs to be conveyed to the processors isthe strip assignment index. For static and partiallydynamic grid block assignment, there is no com-munication cost for the analytical corrections asthey are preassigned strips. The cost of communi-cating the per-processor contributions to the matri-ces J and K or F may be ignored because the local

VOL. 19, NO. 91022


copy of these matrices will be incremented andwill only be globally summed when the pseu-dospectral assembly step has been completed.

Pseudospectral Assembly

The first step in each SCF iteration is the evalua-Ž .tion of eq. 7 using the eigenvectors from either

the initial guess or the previous iteration. Thisterm is used in the assembly of the exchange

Ž .operator, as described in eq. 5 . Next, the three-center, two-electron terms are calculated for eachgrid point and for each pair of basis functions kand l, and these terms are used to compute:

Ž . Ž . Ž .J g s A g r 12Ý k l k l

and:

Ž . Ž . Ž . Ž .K g s A g s g 13Ýj jn nn

Ž . Ž .which are required to evaluate eqs. 1 and 6 .Assembling the contributions from a given

block’s grid points to the Coulomb matrix elementŽ . Ž .is simply a matter of multiplying J g by R gj

and then multiplying the result on the left-handŽ .side by Q g . Finally, the contributions must bei

summed over the grid points for the grid block.Similarly, the grid block’s contribution to the ex-

w Ž .xchange matrix element eq. 5 is evaluated bymultiplying the term just computed, and summingthe resulting terms over all grid points in the gridblock.

As the Coulomb and exchange contributionsfrom each grid block are evaluated, they are storedlocally until all of the grid blocks have been com-pleted. At this point a global sum of the localmatrices is done. The cost of collecting any of the

2 Ž .final matrices is proportional to N log N ,b asi s pr ocand the computational work necessary to assemblethe Fock operator is roughly N 2 N . Therefore,b asi s g r idthe ratio of computation to communication is

Ž .N rlog N . This implies that our strategy re-g r id pr ocquires that N be significantly larger than theg r idlogarithm of the number of processors. This is anextremely modest requirement, as the number ofgrid points is anywhere from 100 to 400 per atom.

Eigenvector Refinement

The majority of the remaining time in our calcu-lations is taken up by the refinement of the eigen-

vectors of the Fock matrix, most of which is spentin diagonalization. Because diagonalization withstandard QR packages scales as N 3 while pseu-dospectral Fock matrix assembly scales asymptoti-cally as N 2, pJaguar needs an efficient and well-distributed diagonalization procedure to solve verylarge electronic structure problems. The serial pro-gram Jaguar includes a Krylov-space diagonaliza-tion method, which uses the Lanczos algorithm torefine eigenvectors describing occupied orbitals.12

This technique allows us to use the eigenvectors ofthe previous SCF iteration as a good approxima-tion to the new eigenvectors. This Krylov-spacemethod allows us to carry out our diagonalizationsin a space that is itself a subspace of the occupiedsubspace. Virtual orbital character is includedthrough the Lanczos algorithm, as described inwhat follows. The Fock matrix diagonalization forthe first SCF iteration is completed with theScaLAPACK diagonalization routine, as good ini-tial guesses for eigenvectors are needed for theLanczos algorithm.

In pJaguar, the Lanczos method affords us animportant advantage in parallelization: each eig-envector is refined independently of the others,leading naturally to a reasonable parallelizationscheme. Although timing results show that overallspeedups resulting from parallelization of theeigensolver are small for our test cases, they clearlyindicate the feasibility of the approach. As wemove to larger problems, we expect that the bene-fits of the eigensolver parallelization will becomeapparent.

Our Lanczos diagonalization procedure beginswith the construction and diagonalization of theFock matrix in the space of the N -occupied or-occ

bitals obtained from either our initial guess or theprevious SCF iteration. The transformation of theorbitals to the occupied subspace requires on theorder of N 2N operations and the diagonaliza-occ

tion in the occupied subspace requires on the or-der of N 3 operations. Because the number ofocc

occupied orbitals is typically significantly less thanthe number of basis functions, N , this step,b asi s

which is carried out on a single node, is consider-ably less time consuming than the diagonalizationof the full Fock matrix. At this point in time, thisoperation has not been parallelized.

Next, we refine each vector independently us-ing the Lanczos algorithm described in ref. 12,again using the pool of tasks paradigm. Specifi-cally, we construct the Krylov-space representation


CHASMAN ET AL.

in the tridiagonal Lanczos form, using the rela-tions:

u s Gwjq1 j

a s wTuj i jq1

uX s u y a w y b wjq1 jq1 j j j jy1

1r2XT Xb s u už /jq1 jq1 jq1

X Ž .w s u rb , 14jq1 jq1 jq1

where a and b are the entries in the tridiagonalj j

matrix, w is the initial guess vector, and G is the0

occupied subspace representation of the Fock ma-trix. As the Krylov space is extended one dimen-

Ž .sion at a time using eq. 14 , the Krylov-spacerepresentation of the Fock matrix, which containsthe parts of the overall vector space of F that are

Žmost strongly coupled to the original vector in-.cluding couplings to the virtual orbitals , is diago-

nalized and the eigenvector that correlates moststrongly with the initial guess vector, w , is moni-0

tored for convergence. Once we have obtained thenew set of approximate eigenvectors, they are or-thogonalized using the Gram]Schmidt procedureand the Fock matrix is rediagonalized in the spaceof the refined eigenvectors.12 This procedure isrepeated until the eigenvectors are converged. Be-

wcause the individual Krylov-space refinements eq.Ž .x14 of the eigenvectors require only that the origi-nal eigenvector be sent to each slave and thenreturned, and the computational work required foreach of the refinements is dominated by the matrix

Ž .multiplications on the first line of eq. 14 , we seethat the ratio of computation to communicationwill be proportional to N 2 rN s N , the num-occ occ occ

ber of occupied orbitals. Thus, as the number ofoccupied orbitals increases with system size, thedistribution of the Krylov-space-based eigenvectorrefinement will become more efficient. Currently,the reorthogonalizationrrediagonalization is car-ried out on a single node.

GRADIENT CALCULATIONS

The pseudospectral method for Hartree]Fockgradient calculations is described elsewhere.24

Here, we briefly sketch the procedure previouslyreported. The derivative of the Hartree]Fock en-

ergy with respect to the nuclear coordinate x is:A

dE d V dH 0 dSHF nuc i j i js q 2 r y 2 WÝ Ýi j i jd x d x d x d xA A A Aij ij

w Ž < . Ž < .xd 2 ij kl y ik jlŽ .q r D 15Ý i j k l d xAijkl

where S is the overlapping matrix, W is the en-ergy weighted density matrix W s Ý e c c ,i j a a ia ja

and e corresponds to the orbital energy of orbitala

a. The first three terms are simple to evaluate. Thefinal term:

w Ž < . Ž < .xd 2 ij kl y ik jlŽ .r r 16Ý i j k l d xAijkl

involves the derivatives of two-electron integrals.These can be expanded as:

Ž < .d ij klx x x xŽ < . Ž < . Ž < . Ž < .s i j kl q ij kl q ij k l q ij kl

d xŽ .17

Ž .The various terms of eq. 16 may be rewritten as:

x Ž . Ž . Ž .2 r Q g R g r A gÝ Ý Ýlk l k i j i jglk ij

x Ž . Ž . Ž .s 2 r Q g R g J gÝ Ýlk l kglk

x Ž .s 2 r J 18Ý lk lklk

Ž .Here, we have used the definitions of J g and J ,i j

Ž . Ž .which appear in eqs. 1 and 2 , respectively. Wenote that the essential difference between the J x inlk

Ž . Ž . Ž .eq. 18 and J in eq. 1 is that Q g has beeni j lxŽ .replaced by Q g , which corresponds to the leastl

squares operator for the derivative function. Al-Žthough some details have been omitted e.g., the

use of selected analytical integrals to decrease the.necessary grid density , it is clear that the decom-

position that we applied to our Fock matrix assem-bly can be applied to our gradient calculations.The calculation of the two-electron integrals has infact been decomposed into analytical correction‘‘strips’’ and grid block assembly in exactly thesame manner that we decomposed our Fock ma-trix assembly. Fortunately, the computation of theanalytical derivatives is constituted almost exclu-sively by the ‘‘strip’’ correction and grid blockassembly, which can both be carried out with highparallel efficiency. This is reflected in our timings,

VOL. 19, NO. 91024


which demonstrate a high efficiency, even for thesmallest molecule used for testing our SCF code.

Results and Discussion

In Tables I]V and Figures 2]6, we present re-sults obtained for our code developed using MPLrMPI on the Cornell Theory Center SP2. The tim-ings presented here were for runs conducted onthe Columbia ChemistryrLamont Doherty Geo-physical Observatory SP2 on a uniform set of 256-MB thin nodes.

We note first that, in the present implementa-tion, parallelization of the least squares fitting rou-tine is not particularly efficient. However, thisconstitutes a small fraction of the CPU time, andhence becomes important only for a large numberof processors. In contrast, parallelization of Fockmatrix assembly is the most efficient aspect of thecalculation. Timings for the entire SCF routine arenot as scalable as the matrix assembly portion,reflecting the small utility routines that are diffi-cult to parallelize.

We note very good speedup of the matrix as-sembly as a function of the number of processors.

TABLE I.( ) UUHartree]Fock SCF Timings Seconds and Parallel Performance for Porphine in the 6-31G Basis

( )430 Basis Functions .

Least MatrixN squares Speedup assembly Speedup SCF Speedup Overall Speedupproc

1 150.4 1.00 2138.6 1.00 2220.5 1.00 2431.9 1.002 76.4 1.97 1091.7 1.96 1165.8 1.90 1297.6 1.874 44.6 3.37 570.5 3.75 640.0 3.47 738.2 3.298 36.8 4.09 308.8 6.93 383.1 5.80 476.3 5.11

16 29.8 5.05 199.7 10.71 281.5 7.89 371.9 6.54

TABLE II.( ) UU ( )Hartree]Fock Timings Seconds and Parallel Performance for BPH in 6-31G Basis 740 Basis Functions .


1 430.8 1.00 15,541.0 1.00 16,097.9 1.00 16,729.6 1.002 220.2 1.96 7829.3 1.98 8265.1 1.95 8687.4 1.934 140.1 3.07 4100.6 3.79 4485.4 3.59 4832.0 3.468 115.7 3.72 2179.1 7.13 2578.7 6.24 2898.5 5.77

16 105.5 4.08 1243.8 12.49 1651.0 9.75 1962.7 8.52

TABLE III.( ) ( )Hartree]Fock SCF Timings Seconds and Parallel Performance for Alanine Pentapeptide in cc-pVTZ -f Basis

( )818 Basis Functions .


1 1807.7 1.00 28,379.6 1.00 29071.5 1.00 31,147.4 1.002 922.0 1.96 14,214.0 2.00 14,824.3 1.96 16,025.2 1.944 508.0 3.56 7250.0 3.91 7722.3 3.76 8505.8 3.668 463.3 3.90 3764.0 7.54 4184.0 6.95 4925.6 6.32

16 438.6 4.12 2011.1 14.11 2447.3 11.88 3160.9 9.85


CHASMAN ET AL.

TABLE IV.( ) UUHartree]Fock Analytical Gradient Timings Seconds and Parallel Performance for Porphine in 6-31G Basis.

One-electron Two-electron Gradient Full geometry Overall( )N gradients serial gradients speedup iteration speedupproc

1 16.9 1845.2 1.00 6137.5 1.002 16.8 918.1 1.99 3186.5 1.934 16.9 477.1 3.76 1716.4 3.588 16.9 260.5 6.71 1037.0 5.92

16 16.8 137.3 11.97 780.0 7.87

TABLE V.( ) ( )Hartree]Fock Analytical Gradient Timings Seconds and Parallel Performance for BPH in cc-pVTZ -f Basis.

One-electron Two-electron Gradient Full geometry Overall( )N gradients serial gradients speedup iteration speedupproc

1 51.3 6156.6 1.00 26898.5 1.002 51.3 3117.9 1.96 13955.8 1.934 51.6 1627.5 3.70 7692.7 3.508 51.5 872.6 6.72 4670.8 5.76

FIGURE 2. Parallel speedups for porphine in the6-31GUU basis.

This is entirely consistent with the performancemodel put forth earlier. At 94% efficiency, theindividual Fock build times for porphine on fourprocessors range from 36 to 145 seconds depend-ing on the grid used and whether or not analyticcorrections are employed for that iteration. Theaverage Fock build time for this case is 63 seconds.For BPH on eight processors Fock build timesrange from 70 to 284 seconds and average 167seconds, at 89% efficiency. The individual Fock

FIGURE 3. Parallel speedups for BHP in the 6-31GUU

basis.

build times for alanine pentapeptide have not beenreduced to absolute times, which are as low while

Žretaining 90% efficiency. At 88% efficiency 16.processors , however, absolute times range from

187 to 322 seconds and average 222 seconds. Themajor contribution to inefficiencies in the Fockmatrix assembly is from poor load balancing. Forporphine on four processors, perfect load balancewould have saved approximately 15.9 seconds ofwall-clock time and yielded a Fock matrix assem-

VOL. 19, NO. 91026


FIGURE 4. Parallel speedups for alanine pentapeptide( )in the cc-pVTZ -f basis.

bly speedup of 3.86. For 8 and 16 processors,perfect load balance would yield speedups of ap-proximately 7.5 and 14.0. The main reason for poorload balance here is an imbalance in the analyticcorrection times. This could probably be maskedby increasing the number of dynamically assignedgrid blocks.

After load balancing in the Fock matrix build,we find that the greatest single hindrance to ouroverall performance results from the paralleleigenvector refinement. The speedups for porphineeigenvector refinement are summarized in Figure7. A total of 27.4 seconds was spent in diagonaliza-

Žtion 9.3 seconds in the first iteration full diagon-alization, 18.1 seconds in the remaining seven

.Lanczos eigenvector refinements for the single-

FIGURE 5. Parallel speedups for porphine gradientcalculations in the 6-31GUU basis.

FIGURE 6. Parallel speedups for BHP gradientcalculations in the 6-31GUU basis.

processor case. For four nodes, total wall-clocktime was 17.8 seconds, and total speedup was 1.5.For 8 and 16 nodes, total times of 15.6 and 16.5seconds were observed, respectively. We can com-pare the performance for the first iteration diago-

Ž .nalization done with ScaLAPACK with the per-formance in the Lanczos iterations to get an idea ofhow much of an improvement is achieved byusing the parallel Lanczos algorithm. If all itera-tions used a parallel diagonalization instead of theLanczos algorithm, we would expect a total timeof 8 = 9.3 s 74.4 seconds for eigenvector refine-ment. For 4, 8, and 6 processors, total time wouldbe approximately 74, 62, and 68 seconds, respec-tively. Thus, we see that we gain a factor three or

FIGURE 7. Eigenvector refinement speedups for theporphine 6-31GUU calculation.


CHASMAN ET AL.

four in wall-clock time with the Lanczos algo-rithm. Although the speedups for both ScaLA-PACK and our parallel Lanczos implementationare rather poor, the Lanczos speedups are higherfor each case. More importantly, the serial diago-nalization times in the Lanczos method are muchlower.

Outside of the Fock matrix build and eigenvec-tor refinement, the remaining parallelizable prob-lem results from N squared matrix multiplies.b asi sWith the parallelization of these matrix multiplies,all major portions of the code that can be ef-fectively parallelized, have been. Some of theportions that remain serial include the DIIS con-vergence scheme, IrO on replicated data, and cal-culation of basis function pseudo overlaps.

It can be seen from the data in Tables I]V thatwe have accomplished our stated goal of achievinglow wall-clock time while maintaining high effi-ciency in comparison to one-node calculations. Ifwe examine only cases where ; 90% efficiency ismaintained, we see that all Hartree]Fock calcula-tions examined here are completed in less than 2.5hours. We also see that the number of nodes thatcan be efficiently employed increases as the prob-lem size increases. For porphine, with 430 basisfunctions, it is only possible to use two nodes at94% efficiency, achieving a total wall-clock time of22 minutes. Increasing the problem size to 740basis functions in BPH allows 4 nodes to be uti-lized at 87% with a total wall-clock time of 1 hourand 21 minutes. And, again increasing the problemsize to 818 basis functions in alanine pentapeptideallows four nodes to be utilized at 92% efficiencywith a total time of 2 hours and 22 minutes.

In our gradient calculations we achieve a higherefficiency than in the corresponding Hartree]Fockcalculation. For porphine, four nodes can be uti-lized efficiently for the gradient calculation whereonly two nodes can be used in the Hartree]Fock

Žcalculation. The total efficiency including the leastsquares, SCF, and one-electron portions of the pro-

.gram at four nodes for a single geometry iterationis 89%. For geometry optimization problems wherethe result is urgently desired, one might chooseto accept lower efficiency in return for fasterturnaround. In the case of porphine, running oneight nodes leads to 74% overall efficiency.

Conclusion

In this article, we have presented our parallelalgorithm for pseudospectral electronic structure

calculations and proven its viability by showingthat the ratio of computation to communication ismuch larger than 1 for each of the tasks that wehave successfully parallelized. In addition, we haveprovided timing data for molecules having 430,740, and 818 basis functions. Further, we havedetailed results for our parallel gradient code andprovided timings for a system with 430 basis func-tions. Despite its deficiencies, our code can cur-rently carry out a gas-phase geometry optimi-

Žzation cycle i.e., a Hartree]Fock calculation,.HF-gradient calculation, and geometry step on

Ž .porphine 430 basis functions in roughly 20 min-utes on an eight-processor SP2 partition.

In applications of electronic structure calcula-tions, work is routinely carried out for a diversemix of chemical problems, each of which requiresinvestigation of a substantial number of molecules.Thus, overall throughput is of comparable impor-tance to turnaround time for an individual job,and one has to strike a balance between these twoobjectives. As noted earlier, the pseudospectralmethod is inherently more efficient than corre-sponding analytical approaches, and it is thereforeprobable that parallelization of an analytical Har-tree]Fock code on the IBM-SP2 would efficientlyuse more nodes than we have. However, greaterparallel scalability at the expense of much greatercomputational cost is pointless. Our design strat-egy has been to preserve the high efficiency of oursingle-node performance, and to carry out paral-lelization within this constraint.

Parallelization of the Hartree]Fock functionalityis only the beginning step in the development ofparallel pseudospectral algorithms. It is, however,the most critical one because the algorithms for

Ž .electron correlation GVB, MP2, GVB-LMP2, DFTall utilize the core Hartree]Fock computationalfunctions, such as assembly of Coulomb and ex-change operators. Our local MP2 algorithm hasalready been parallelized, as is reported in part IIof this series.25 Efficiencies that are superior tothose reported here are obtained, as many of thesmall auxiliary routines used for SCF convergenceare not relevant to an LMP2 calculation. We expectthat similar efficiencies will be obtained for theadditional correlation methods just listed.

Supplementary Material

The structure files for each of the moleculesused in timing runs in this article are available inXMol’s xyz format.

VOL. 19, NO. 91028


References

1. Jaguar v3.0, Schrodinger, Inc., Portland, OR, 1997.¨2. M. Feyereisen and R. A. Kendall, Theor. Chim. Acta, 84, 289

Ž .1993 .Ž .3. H. P. Luthi and J. Almlof, Theor. Chim. Acta, 84, 443 1993 .¨ ¨

4. S. Vogel, J. Hutter, T. H. Fischer, and H. P. Luthi, Int. J.¨Ž .Quantum Chem., 45, 665 1993 .

5. S. Brode, H. Horn, M. Enrig, D. Moldrup, J. E. Rice, and R.Ž .Ahlrichs, J. Comput. Chem., 14, 1142 1993 .

6. D. Bernholdt, E. Apra, H. Fruchtl, M. Guest, R. Harrison, R.Kendall, R. Kutteh, X. Long, J. Nicholas, J. Nichols, H.Taylor, A. Wong, G. Fann, R. Littlefield, and J. Nieplocha,

Ž .Int. J. Quant. Chem. Quantum Chem. Symp., 29, 475 1995 .7. M. Guest, P. Sherwood, and J. van Lenthe, Theor. Chim.

Ž .Acta, 84, 423 1993 .8. I. T. Foster, J. L. Tilson, A. F. Wagner, R. L. Shepard, R. J.

Harrison, R. A. Kendall, and R. J. Littlefield, J. Comput.Ž .Chem., 17, 109 1996 .

9. R. J. Harrison, M. F. Guest, R. A. Kendall, D. E. Bernholdt,A. T. Wong, M. Stave, J. L. Anchell, A. C. Hess, R. J.Littlefield, G. L. Fann, J. Nieplocha, G. S. Thomas, D.Elwood, J. L. TIlson, R. L. Shepard, A. F. Wagner, I. T.Foster, E. Lusk, and R. Stevens, J. Comput. Chem., 17, 124Ž .1996 .

10. B. H. Greeley, T. V. Russo, D. T. Mainz, R. A. Friesner, J.-M.Langlois, W. A. Goddard III, R. E. Donnelly Jr., and M. N.

Ž .Ringnalda, J. Chem. Phys., 101, 4028 1994 .

Ž .11. R. Shepard, Theor. Chim. Acta, 84, 343 1993 .12. W. T. Pollard and R. A. Friesner, J. Chem. Phys., 99, 6742

Ž .1993 .13. L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Dem-

mel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry,A. Petitet, K. Stanley, D. Walker, and R. C. Whaley,ScaLAPACK User ’s Guide, Society for Industrial andApplied Mathematics, Philadelphia, 1997.

Ž .14. R. A. Friesner, Chem. Phys. Lett., 116, 39 1985 .Ž .15. R. A. Friesner, J. Chem. Phys., 85, 1462 1986 .Ž .16. R. A. Friesner, J. Chem. Phys., 86, 3522 1987 .Ž .17. R. A. Friesner, J. Phys. Chem., 92, 3091 1988 .

18. M. N. Ringnalda, Y. Won, and R. A. Friesner, J. Chem.Ž .Phys., 92, 1163 1990 .

19. J. M. Langlois, R. P. Muller, T. R. Coley, W A. Goddard III,M. N. Ringnalda, Y. Won, and R. A. Friesner, J. Chem.

Ž .Phys., 92, 7488 1990 .20. M. N. Ringnalda, M. Belhadj, and R. A. Friesner, J. Chem.

Ž .Phys., 93, 3397 1990 .21. P. Gill, M. Head-Gordon, and J. Pople, J. Chem. Phys., 94,

Ž .5564 1990 .22. P. Gill, M. Head-Gordon, and J. Pople, Int. J. Quantum.

Ž .Chem., S23, 269 1989 .23. H. P. Luthi, J. E. Mertz, M. W. Feyereisen, and J. E. Almlof,¨ ¨

Ž .J. Comput. Chem., 13, 160 1992 .24. Y. Won, J.-G. Lee, M. N. Ringnalda, and R. A. Friesner, J.

Ž .Chem. Phys., 94, 8152 1991 .25. M. D. Beachy, D. Chasman, R. A. Friesner, and R. B.

Ž .Murphy, J. Comput. Chem., 19, 1030 1998 .


parallel pseudospectral electronic structure: i. hartree-fock calculations

Documents