calculating electron repulsion integrals on gpu and cpu...
TRANSCRIPT
![Page 1: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/1.jpg)
Calculating Electron Repulsion Integrals
on GPU and CPU Architectures
Henry Lambert
January 7, 2010
1
![Page 2: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/2.jpg)
Contents
1 Abstract 2
2 Introduction 3
3 Physical Considerations 7
3.0.1 Electron-Electron Repulsion Algorithm . . . . . . . . . . . . . . . . . 7
3.0.2 Gaussian Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.0.3 McMurchie-Davidson . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Computational Considerations 13
4.1 GPU Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Method: ERI Implementation on GPU 18
6 Results and Discussion 23
7 Conclusion 26
8 Acknowledgement 28
A NVIDIA Hardware Specifications 29
B GPU Kernel Code 30
![Page 3: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/3.jpg)
1 Abstract
Large demand for high quality Graphical Processing Units (GPUs) for applications in video
games and home entertainment systems has led to the introduction of machines with enor-
mous computing potential at a relatively low price. The NVIDIA Corporation has provided a
C-like extension language called Compute Unified Device Architectures[9] to allow scientific
programmers to exploit the processing power of the GPU for intensive numerical computa-
tions. Computational Chemistry is a field full of numerically intensive computing tasks; in
particular the calculation of the coulombic electron repulsion integrals (ERIs) in molecular
systems scales as O(N4) with the number of electrons in a system. This rapid increase in
computational demands with system size represents a significant bottleneck in all ab initio
molecular modeling programs. Building on the McMurchie-Davidson algorithm[2] (which
describes a method for the computation of ERIs), this project attempted to implement the
ERI section of the existing EXCITON software package onto a GPU using a method devised
by Ufimtsev and Martinez [3]. Ufimtsev and Martinez have used GPUs to speed up the
calculation of ERIs for non-trivial chemical systems by a factor of up to 140 times. This
project’s GPU implementation of ERIs in the EXCITON software package was found to run
at comparable speeds to the CPU implementation of EXCITON for the smallest atomic sys-
tems like He and Be, and to run faster by a factor of 2 for the larger Neon atom system. As
the code is completed to run on larger, many-atom systems, this performance enhancement
is expected to continue to increase favourably.
2
![Page 4: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/4.jpg)
2 Introduction
The demand for Graphical Processing Units (GPUs) in home entertainment systems has led
to the introduction of machines with enormous computing potential at a low price. The
computing potential and availability of GPUs have made them good candidates to increase
the speed and efficiency of high performance computers for applications in computational
sciences. The use of GPUs could allow users to increase computational performance by
orders of magnitude over traditional multi-core architectures while still using a high level
language with distinct similarities to C [20]. Figure 1 demonstrates the scale of performance
increase of GPUs compared to traditional Central Processing Units (CPUs) in the last few
years.
Figure 1: Relative performance in GFlops (giga-Floating Point Operations Per Second) of different GPUmodels vs CPUs in linear algebra applications at peak performance over the last six years.
This wide divergence in performance shows that, for parallel computing applications, CPU
speeds are being soundly trumped by that of the GPUs. Not only are the GPUs powerful
computing machines but they are relatively inexpensive. These graphics cards have a large
number of processing cores (the element of the chip which executes instructions) per device
and a distributed shared memory model: both of which are suited to doing intensive, parallel,
numerical computations as will be discussed in more detail subsequently. The NVIDIA cor-
3
![Page 5: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/5.jpg)
poration is one of the leaders in graphics card production and is interested in fostering a new
market for professionals looking to exploit their technology in diverse fields: computational
finance, fluid dynamics, and finite element analysis. NVIDIA has provided a C-extension
language called CUDA[10] (Compute Unified Device Architecture) in order to facilitate port-
ing code from CPUs onto the GPUs to exploit these highly parallel systems in search of
massive speed ups. This project sought to transport the section of an existing code package
called Exciton1 concerned with the calculation of Electron Repulsions Integrals (ERIs) to the
GPU. EXCITON is a code for calculating the influence of many body effects in molecular sys-
tems. Incorporating some GW code [14] and considerations introduced by the Bethe-Salpeter
equations, the EXCITON software package is meant to improve upon traditional mean field
theory by considering the motion of electron and hole bound states and their interaction with
plasmons and phonons in a solid. Any speed-up in the calculation of two-electron repulsion
integrals obtained during this transport onto the GPU will be benchmarked and compared to
the original CPU implementation, difficulties and advantages of programming using the GPU
will be discussed, and possibilities for further implementation of traditional computational
techniques in solid state chemistry on this novel architecture will be considered.
Quantum theory is capable of a full description of the dynamics of any molecular system.
However, solving the equations which govern the dynamic development of these molecular sys-
tems quickly becomes intractable. Density Functional Theory (DFT) [12] and Hartree-Fock
(HF) Theory are the two most popular frameworks for solving the many-electron Schrodinger
equation governing the dynamics of a molecular quantum system. An important step in both
DFT and HF is the calculation of the classical coulomb repulsion force between two electrons
in a molecular system. The number of these integrals to be evaluated scales as O(N4), where
N is the number of electrons in the interacting molecular system. This O(N4) scaling means
for even relatively small systems the number of integrals to be evaluated is very large. For
instance, without using any symmetry arguments, a single benzene molecule which has 42
electrons, with would require the calculation of 388 962 electron repulsion integrals. That
holds for an isolated benzene molecule in a vacuum. For physically interesting systems, with
real technological applications, one may want to consider a system of graphene with hundreds
of six-carbon rings. A system of this size would push the number of two-electron repulsion
1The main developers of the EXCITON software package are Charles Patterson, Svetjlana Galamic-Mulaomerovic, and ConorHogan. Contact: [email protected]
4
![Page 6: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/6.jpg)
integrals to be evaluated well into the billions. Addressing this computational workload is
of real importance for describing the electronic properties of a system and requires novel
approaches including restructuring the traditional algorithm and experimenting with new
computing architectures.
To model the electronic structure of a molecular system requires an initial ansatz to de-
scribe the orbitals of the system’s electrons. From this ansatz it is possibly to iterate towards
a description of the molecular system’s ground state. S.F. Boys[1] was the first to suggest
using a linear combination of Gaussian type orbitals as the ansatz for the electronic charge
distribution. From the Gaussian orbitals it is possible to obtain explicit numerical descrip-
tions of the electronic orbitals in any molecular system; in Boys’ words, “to any required
degree of accuracy with the only bound being, the labour of computation.” The advantages
of these Gaussian orbitals are that their integrals can be evaluated in a straightforward man-
ner and that they are convenient for modeling charge distributions for electronic orbitals of
any angular momentum on multiple nuclear centers. The use of Gaussian Type Orbitals
and the McMurchie-Davidson scheme has gained widespread use and has facilitated the nu-
merical modeling of chemical systems. These schemes have benefited enormously from the
exponential growth in computer processing power implied by Moore’s Law.
In 1965 Gordon Moore wrote an article[7] for Electronics Magazine where he projected
that the the number of transistors on a silicon chip at a given price point would double every
eighteenth months. The number of transistors on a chip is intimately correlated with the
processing speed a computer is capable of achieving.
Moore’s law has held up to a remarkable degree of accuracy since its original statement.
Today however the semi-conductor industry is approaching the physical limits on the number
of transistors which can be placed on a single silicon chip. Excessive heat generation due to
transistor density and quantum mechanical effects, such as electrons tunneling through the
few nanometers which separate transistors, present fundamental engineering difficulties. To
ensure that processing power continues to grow exponentially new approaches are necessary.
New technologies and compute architectures are needed to ensure the next generation of
processors to extend Moore’s law.
A popular model for extending Moore’s law involves introducing graphical processing units
(GPUs) for computing applications in a GPU-CPU hybrid computing system. This hybrid
5
![Page 7: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/7.jpg)
computing approach involves distributing work to the device architecture according to its
nature. Serial code should be run on the CPU and numerically intensive processes which
are capable of being parallelized should be run on the GPU. This is because the compute
architecture of GPUs is much better suited to highly parallelized applications then that of
a traditional CPU. The NVIDIA programming guide describes succinctly how the features
of GPUs that make it suitable for image rendering translate to highly parallel scientific
computing applications:
In 3D rendering, large sets of pixels and vertices are mapped to parallel threads.
Similarly, image and media processing applications such as post-processing of ren-
dered images, video encoding and decoding, image scaling, stereo vision, and pat-
tern recognition can map image blocks and pixels to parallel processing threads.
In fact, many algorithms outside the field of image rendering and processing are
accelerated by data-parallel processing, from general signal processing or physics
simulation to computational finance or computational biology. [10]
The GPU contains many more processing cores, a shared memory model with high on-device
bandwidth, and a highly multi-threaded structure for image processing; all these features
favour parallelized data processing.
Professor Saman Amarasinghe of MIT has suggested that the combination of CPU and
GPU processors into a single hybrid computing model represents a paradigm shift that mir-
rors previous transitions in computing technology. He cites the shift from assembly language
in the 1970s to high-level languages like C and Fortran, and the transition to object oriented
programming in the 1990s[19]. The first paradigm shift was meant to give the programmer
more freedom to consider abstract algorithms without the need to focus on detailed aspects of
the computer’s physical architecture. The shift to object oriented programming was meant
to facilitate the integration of huge bodies of code from large-scale software development
projects with a large number of widely distributed programmers. This transition to a hybrid
computing model will require the re-thinking of traditional algorithms to utilize the process-
ing power of the GPU and surpass the aforementioned engineering difficulties that hinder
the continuation of Moore’s law.
6
![Page 8: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/8.jpg)
3 Physical Considerations
3.0.1 Electron-Electron Repulsion Algorithm
In order to predict and explain the electronic, magnetic, and structural properties of molec-
ular systems it is necessary to compute the quantum mechanical wave equations governing
their dynamics. In crystal systems there are multiple nuclear centers surrounded by varying
numbers of electrons. These sorts of systems are many-body systems. Numerous approaches
have been developed for studying many-body interactions but the problem remains exceed-
ingly difficult even for idealized situations. These many-body systems possess far too many
degrees of freedom to be reduced to a simple analytical system of equations. It therefore
becomes necessary to develop numerical approaches to the solutions of the equations and
the modeling of their behaviour. The most widely used approaches to studying many body
molecular systems involve Hartree-Fock methods, and Density Functional Theory. In both
cases the ground state energy of a system can only be determined by solving a large system
of non-linear differential integral equations. A discussion of the Hartree-Fock methods and
certain considerations in DFT describe the physical context in which two electron repulsion
integrals arise and what information can be gained by calculating them. This section is the
most mathematically intensive of the paper and introduces key notation. It begins with a
brief definition of the Slater Determinant, and then states the Hamiltonian used to describe
the dynamics in these molecular systems. The second half of this section focuses on the rep-
resentation of electronic wave functions by gaussian orbitals and the method of calculating
two electron coulomb repulsion integrals algorithmically. Understanding the physical context
in which ERIs arise and the algebraic and algorithmic steps to calculate them are important
for understanding how the computation is implemented on the GPU.
In Hartree-Fock theory, the ground state many-electron wave equation is constructed from
a Slater Determinant (equation 1) of single electron wave functions. Higher energy states
can then be constructed from linear combinations of these “detors”2. The properties of such
determinant ensures that all the conditions for a system of interacting fermions are met;
for instance, it ensures that the wave function is anti-symmetric with respect to particle
exchange.
2S. F. Boys suggested calling these determinants of normalized, orthogonalized single particle wave functions “detors” butthat never made it into general use
7
![Page 9: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/9.jpg)
ΨHF0 =
1√N !
∣∣∣∣∣∣∣∣∣ψ1(r1) ψ2(r1) . . .
ψ1(r2) ψ2(r2) . . ....
.... . .
∣∣∣∣∣∣∣∣∣ (1)
In the Hartree Fock scheme this ground state wave function ΨHF0 is then minimized by
finding the appropriate ψis. Starting from a trial solution the orbitals are systematically
modified until the system is in its lowest energy conformation. The inherent assumptions are
that the system is not relativistic and the nuclei are static.
The Schrodinger equation for the Hartree-Fock approach can be written [15]:
HHF |Ψ0 >= EHF |Ψ0 > (2)
where the Hamiltonian HHF is composed of the one electron terms:
hi =
(−~2
2m∇2i −
Ze2
ri
)(3)
the two electron coulomb repulsions integrals:
Ji(r) =
∫ρ(r′)
|r − r′|dr′ (4)
which take the expectation value:
< Jij >=
∫ ∫drdr′ψi(r)ψ
∗i (r)
e2
|r − r′|ψj(r
′)∗ψj(r′) (5)
and the two electron exchange integrals (a purely quantum mechanical interaction arising
in the expansion of the determinant):
Ki(r, r′) =
∫ψ∗j (r
′)ψi(r)
|r − r′|(6)
which have the expectation value:
< Kij >=
∫ ∫drdr′ψ∗i (r)ψj(r)
e2
|r − r′|ψi(r
′)ψ∗j (r′) (7)
It is the coulombic Jij and Kij terms which quickly proliferate as will be seen in section
8
![Page 10: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/10.jpg)
3.0.3. The Kij terms require a different implementation on the GPU and are treated by
Ufimtsev and Martinez[4]. This paper is only concerned with the calculation of the classical
coulombic two-electron repulsion integrals.
Even though it introduces no new physics, it is worth looking at the representation of the
previously mentioned one electron eqn. 8 and two electron egn. 9 terms arising in Hartree-
Fock theory in the notation of second-quantization. It is this notation that will be used to
refer to the one and two electron terms in the following discussion and conclusion.[13]
Single electron terms:
O1 =∑ij
hija†iaj (8)
Two electron terms:
O2 =1
2
∑ijkl
gijkla†ia†kajal (9)
3.0.2 Gaussian Functions
In an important paper, S. F. Boys [1] showed that to obtain the numerical electronic wave
functions in any atomic or molecular system, it is sufficient to model the system entirely
with Gaussian functions and the derivatives of these Gaussian functions. Gaussians are not
only sufficient for constructing a system of functions that accurately describe any molecular
system but also allow for the explicit evaluation of the necessary integrals that arise. Boys
concluded that Gaussian functions satisfactorily model the charge distribution of electrons
in crystal systems, to any degree of accuracy, and could be used, “with the molecular orbital
method, or localized bond method, or the generalized method of treating linear combinations
of many Slater Determinants by the variational procedure.” Gaussian functions also possess
certain mathematical properties which make them easy to manipulate.
A model gaussian for an s orbital takes the form[13]:
φGF1s (α, r−RA) = (2α
π)
34 e−α|r−RA|2 (10)
Where RA is the nuclear co-ordinate and r is the electron’s co-ordinate. Modeling the
charge distribution using a Gaussian differs in form from the analytical treatment of a 1s
electron. The analytic solution for a single electron wave function takes the form of a Slater-
9
![Page 11: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/11.jpg)
type orbital:
φSF1s (γ, r−RA) = (γ3
π)
12 e−γ|r−RA| (11)
The functional behaviour of a Gaussian is different from a Slater type orbital: particularly
in the difference of the value of the derivative at r = 0, and the faster falling off of the Gaussian
function at large r. This difference in functional behaviour is mitigated partly by the ease
of computation for Gaussian functions and partly by taking linear combinations of multiple
Gaussian functions to better approximate the Slater function.
The ease of computation of Gaussian functions alluded to earlier comes from the fact that
the product of two Gaussian orbitals is again a Gaussian and that easily defined recursive
relations exist to calculate the higher order derivatives of Gaussian functions. The first
property of Gaussians mentioned allows one to reduce the product of two electronic charge
distributions on two different nuclear centers to a single charge distribution around a single
modified center.
φGF1S (α, r−RA) ∗ φGF1S (β, r−RB) = KABφGF1s (p, r−RP) (12)
KAB = (2αβ
(α + β)π)
34 exp(
−αβα + β
|RA −RB|2) (13)
Where p = α+β is the sum of the Gaussian exponents, and RP is the reduced co-ordinate.
As stated previously, the functional behaviour of a Gaussian is different from the functional
behaviour of the analytic solution. To better imitate the functional behaviour of the ana-
lytically determined Slater function a linear combination of “primitive” Gaussian functions
(identical in form to a normal Gaussian) is used to form a contracted Gaussian function. The
number of Gaussian functions in this contraction is called the contraction length L. These
contracted Gaussian basis sets have been numerically optimized and a number of readily
obtained, standardized basis sets exist[21].
φCGFµ (r−RA) =L∑p=1
dpµφCGFp (αpµ, r−RA) (14)
The integrals for a contracted basis function are obtained just by summing up integrals
10
![Page 12: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/12.jpg)
over the individual primitive functions using the appropriate contraction co-efficients. It is
important to note the co-efficients of each primitive are not optimized: only the contraction.
This sum over primitive integrals increases the computation workload and is addressed in
section 5.
3.0.3 McMurchie-Davidson
The McMurchie-Davidson scheme[2] for calculating two electron integrals over cartesian gaus-
sian functions is reproduced here. The original paper gives a method for computing these
integrals by using a few simple auxiliary functions which are in turn defined using differ-
ent recursion relations. Understanding certain aspects of this algorithm will make certain
choices that were made for transferring the algorithm to the GPU clearer. It will also help
understand where performance limitations come in between algorithmic demands and the
physical capabilities of the GPU. For instance, for higher angular momentum functions the
McMurchie-Davidson scheme requires the temporary storage of so many double precision
numbers that the arrays will exceed the capacity of shared memory (discussed in section
5) causing a performance degradation. The important thing to take away from this section
is that all two-electron repulsion integrals, between orbitals of any angular momentum, are
proportional to the derivatives of the x, y, z co-ordinates of the incomplete gamma function.
Start from a generalized Gaussian basis function describing an electron’s charge distribu-
tion on nuclear center A:
φ(n, l,m, αA, A) = xnAylAz
mA e−αAr2
A (15)
and then incorporate another gaussian basis function describing an electron’s charge dis-
tribution on nuclear center B. The charge distribution of the two basis functions on the two
different nuclear centers A and B can be written:
Ωij = φiφj = xnAxn′
B ylAy
l′
BzmA z
m′
B exp[−(αAr2A + αbr
2B)] (16)
Using the relations from 3.0.2 this finally takes the form:
exp[−(αAr2A + αbr
2B)] = Eijexp[αpr
2p] (17)
11
![Page 13: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/13.jpg)
where
Eij = exp[− αaαbαA + αB
|A−B|2]
The essence of the algorithm is that higher angular momentum wave functions are de-
scribed as simply the partial derivatives of an s-orbital wave function. The s-orbital wave
function is then just proportional to the incomplete gamma function. Hence following Boys,
McMurchie-Davidson writes the general two electron repulsion integral between electronic
orbitals of any angular momentum wave function as a series of partial derivates of a zero
angular momentum Gaussian function:
[NLM | 1
r12
|N ′L′M ′] = (∂
∂Px)N(
∂
∂Py)L(
∂
∂Py)M(
∂
∂Qx
)N′(∂
∂Qy
)L′(∂
∂Qz
)M′([000| 1
r12
|000]) (18)
where P and Q are the nuclear centers:
P =αA + βB
α + β(19)
Q =γC + δD
γ + δ(20)
To calculate equation 18 efficiently, Mcmurchie Davidson defines a further auxiliary func-
tion:
RNLM = (∂
∂a)N(
∂
∂b)L(
∂
∂c)M∫ 1
0
e−Tu2
du (21)
where T from equation 29 is defined:
T = α(a2 + b2 + c2) (22)
Using this auxiliary equation and a series of recursion relations defined by McMurchie
Davidson then equation 18 can be computed in a straightforward manner. Since the GPU
does not support recursive function calls the recursion relations defined in McMurchie-
Davidson to calculate 21 had to be expanded as a series of nested for-loops to be executed
by the GPU.
The pre-factors which are computed recursively on the CPU are stated in equations 24-26
and are calculated using recursion relations similar to that of Hermite polynomials. Only the
12
![Page 14: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/14.jpg)
results for the x axis are reproduced here but the arguments are easily extended for the y
and z co-ordinates:
xAΛN(xp;αp) = NΛN−1 + ~(PA)xΛN +1
2 ∗ αpΛN+1 (23)
Λj(xp;αp)e−αpx2
p) = (∂
∂Px)j (24)
Λj(xp;αp) = αj/2p Hj(α12p xp) (25)
xnAxn′
B =n+n′∑N=0
dnn′
N ΛN(xp;αp) (26)
Hj(α12p xp) are Hermite polynomials defined in the usual way.
The core integral in equation 18 was calculated by Boys to be:
[000|r−112 |000] = λF0(T ) (27)
Where F0 is the incomplete gamma function, and Lambda and T are terms that will be
pre calculated on the CPU as discussed in section 5.
λ =2π5/2
αPαQ(αP + αQ)(28)
T =(PQ)2
αPαQ(αP + αQ)(29)
Besides describing these efficient recursive algorithms for calculating different components
of the ERIs, the McMurchie-Davidson scheme also details a method for evaluating the incom-
plete Gamma function to the required numerical accuracy through the use of an interpolating
table. This procedure resides as a device side function.
4 Computational Considerations
4.1 GPU Hardware
The advantage of a GPU over a CPU in speed stems from the physical layout of the GPU
device: specifically the way memory is distributed on the chips in the GPU, and the GPU’s
13
![Page 15: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/15.jpg)
multiple processing cores, which can operate in parallel. The processing cores are the part of
the computer which execute the instructions defined by the programmer. The advantage of
the GPU for computing comes from threading resources.The NVIDIA Programming guide
explains the relative advantage of the GPU model, “Execution pipelines (CPU cores) on host
systems can support a limited number of concurrent threads. Servers that have four quad-core
processors today can run only 16 threads in parallel ... By comparison, the smallest executable
unit of parallelism on a GPU device, called a warp, comprises 32 threads. All NVIDIA GPUs
can support 768 active threads per multiprocessor, and some GPUs support 1,024 active
threads per multiprocessor. On devices that have 30 multiprocessors this leads to more than
30,000 active threads.” [9] The maximum theoretical limit on the number of threads which
can be run on a Tesla Device is 2198956147200 threads. The actual implementation of the
ERI algorithm on the GPU will be discussed in section 5. This section will describe the basic
considerations of general parallel computing and the different layouts of the GPU architecture
and the traditional CPU architecture.
To determine whether or not a computation will benefit from being transported to a
parallel architecture, Amdahl’s law gives an approximation to the expected benefits of par-
allelization. A modern statement of Amdahl’s law is given by equation 30 which states that
if a fraction f of a computation is sped up by a factor of S, the speed up for the entire code
will be[18]:
Speedupoverall(f, S) =1
(1− f) + fS
(30)
Graphically this law can be interpreted as in figure 2. Clearly the most important factor
is the proportion of the existing algorithm which can be parallelized.
Generating all the combinations of electronic repulsion integrals that need to be calculated
is a relatively small proportion of the calculation; for certain systems this can amount to 1
part in a thousand of the total execution time. This small serial component of the ERI
algorithm means the fraction of the computation that can be parallelized and sped up is very
large. This makes the denominator in equation 30 very small and hence the total potential
benefit from parallelizing the algorithm across the many core architecture of the GPU very
large. This large potential benefit of running the ERI algorithm across the large number of
processing cores available on the GPU is the motivation for this project.
The general paradigm for parallel programming breaks into two models: the master-slave
14
![Page 16: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/16.jpg)
Figure 2: This figure demonstrates the relative performance enhancement of a computation with respects tothe fraction of the algorithm which can be parallelized and divided between multiple processors.[22]
model and the peer model[9]. In the master-slave model, the master node runs a serial
program and then distributes the more computationally intensive aspects to its slave nodes.
In the peer model, a collection of nodes runs the same instructions simultaneously. The
GPU-CPU model in fact possesses both levels of parallelism. The first level is master-slave
parallelism where the CPU runs a serial program and then calls its ‘slave node’, the GPU,
to run the computationally intensive electron repulsion integral algorithm in SIMD fashion.
The GPU itself is the second level of parallelism. It is a peer model parallel processor with
division into different streaming multi-processors called blocks which are composed of an
ordered array of threads capable of running instructions in parallel.
Kirk Hwu and David Kirk of Nvidia use the analogy of a peach to demonstrate the
appropriate application of GPUs within this paradigm (Figure 3). The hard peach pit is the
serial component of an algorithm, which the CPU can effectively process rapidly, using its
own high clock rates, memory architecture, and compiler techniques. The soft peach flesh
represents the data-parallel aspect of the algorithm which can be divided across the 240
processors cores of the Tesla GPU for high throughput computing.
Figure 4 is a comparative schematic of the GPU and CPU architectures. Each of the
figures in this section has a fairly detailed caption. Studying the schematic figures is the
easiest way to get a sense of the computational resources that the GPU provides and how
15
![Page 17: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/17.jpg)
Figure 3: David Kirk and Wen-mei Hwu’s peach analogy. The hard pit is the serial component of thealgorithm and the flesh part is ripe for data parallel exploitation. Trying to bite into the serial componentis equivalent to porting intrinsically serial code onto the GPU: it will not be successful. The flesh part issuitable for GPU implementation. [6]
these resources influence programming decisions.
The type of data parallelism used by current NVIDIA GPUs is essentially Single Instruc-
tion Multi Data (SIMD). The NVIDIA company prefers to call it SIMT parallelism (Single
Instruction Multi-Thread) to emphasize the highly threaded design of their GPU devices, but
the details of the two approaches are identical. SIMD means that the same set of instructions
are defined and executed by the many processing threads in parallel on the entirety of a large
data set. The set of instructions that is executed is defined in the kernel using NVIDIA’s
CUDA programming framework. The NVIDIA programming guide states, “the kernel is the
set of instructions that is executed on the GPU. C for CUDA extends C by allowing the
programmer to define C functions, called kernels that, when called, are executed N times in
parallel by N different CUDA threads, as opposed to only once like regular C functions”[10].
The kernel function determines the tasks that each individual thread will perform. The indi-
vidual kernel executing threads are organized into blocks. In the GPU programming model
each of these cores loads a single self contained computational unit called a block. For this
project each single computational unit or block will calculate a single Electron Repulsion
Integral. The block is a grid of threads which is loaded into one of the GPU’s cores and
executed from start to finish. The number of threads per block is limited by the amount
of shared memory available to each processing core/block. Each processing core executes a
single block at a time, from start to finish, before loading the next block. This means that
the number of threads per block is limited by the GPU resources like register space (which
stores temporary variables) and shared memory space.
Inspection of Figures 4 and 5 shows the cache memory (fast access memory) is a single large
16
![Page 18: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/18.jpg)
contiguous grouping, as are the Arithmetical Logical Units (which are transistors responsible
for the mathematical manipulations), and the instruction unit (labeled control). This device
topology is optimized for large serial computations: a small number of cores distributing data
in a linear fashion to nearby ALUs and accessing a large pool of nearby cache memory. The
GPU architecture has a very different layout to the CPU. The key consideration engineers
have taken into consideration when designing the GPU is parallel algorithm execution. A
graphics card for high quality video rendering needs to update the individual pixels on a
screen many times per second therefore there are multiple control centers with their own high
speed cached memory and a large number of ALUs per processor to meet this engineering
requirement. Included in appendix A are the technical specifications of the GPU used in this
project.
Figure 4: Comparison of CPU chip layout and GPU chip layout. The green ALUs (arithmetical logical pro-cessing units) are responsible for doing all the math and logic manipulations of the data. As is demonstratedin the schematic the GPU has a large number of ALUs allocated to each control unit or streaming multipro-cessor (reference section 7) and its own easily accessed share memory. This layout allows the control unit tomanipulate at high speed large quantities of data in parallel. The ‘DRAM’ in both cases is not located onchip, requires more clock cycles to access, and is called global memory.
The GPU memory schematic is included in Figure 6. For threads residing on the same
block there is a high degree of data-visibility; that is to say, each thread in the same block
can communicate rapidly. This high visibility is mediated through the shared memory and
registers all located within the block. There is 16 kb of shared memory and 16kb of registers
available on each block. If the required data does not fit into these confines, or the size of
certain arrays aren’t defined at compile time there is memory overflow and the compiler will
start allocating the data to the address space in local memory. Local memory, in terms of
the layout of the graphics chip is physically equivalent to global memory. Accessing data
stored within global memory possesses a relatively high level of latency. A large part of the
coding process should be concerned with designing the algorithm to avoid overflow into local
17
![Page 19: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/19.jpg)
memory. The constant memory space and texture memory (fig. 6) spaces are both cached
and provide alternatives to global/local memory storage.
Data stored within the cache memory space is more readily available. Cached memory
accesses take as long as global/local memory for the first access, but subsequent reads can
be as fast as on chip shared memory since the data has already been retrieved and remains
available in the cache. On-chip manipulations (those residing within register and shared
memory space) cost about 4-6 clock cycles per instruction; accessing data in local memory
requires 400-600 clock cycles. A two order of magnitude performance degradation is generally
unaccceptable and requires careful manipulation of the algorithm to store as much data on
chip as possible and ensure that if data must be read or written to global memory it is
coalesced. Coalesced accesses occur when the memory addresses which are written to are
sequentially ordered and occur in multiples of 32 (the warp size).
Figure 7 captures the essence of the GPU programming model. The algorithm is par-
titioned into a kernel grid. The kernel is the function launched on the GPU. The kernel
is composed of a grid of blocks. The SIMD model mandates each block should be a self
contained set of threads which execute the kernel algorithm. The classic example is matrix
multipication where each block loads a square segment of the matrix into the on chip shared
memory. After the square slabs of the matrix elements have been loaded into the on-block
shared memory, each block loops over the rows and columns of the data subset. When each
block has finished its partial summation then there is a final sum reduction between blocks,
by reading and writing to global memory, to generate the final product matrix. The GPU
code used for this project has been included in Appendix B and follows a similar algorithmic
route to the matrix multiplication: that is, work is partitioned between blocks and finally
written back to global memory. The attached code may give a better understanding of what
is meant by blocks and threads and how these organizing principles can be used to write
parallel code that is executed on the GPU.
5 Method: ERI Implementation on GPU
Typical implentation on a CPU of the Electron Repulsion Integrals is done in a serial fashion
with four outer loops sequencing through all the unique combinations of electron orbitals
18
![Page 20: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/20.jpg)
Figure 5: A more detailed depiction of the GPU device architecture. Showing the passage of data and theinstruction set from the Host CPU to the GPU and into the global memory space. This figure also displaysthe partitioned layout of the GPU for fast processing of data and the load/store two way communicationbetween the on chip GPU shared memory space and the off chip slow access global memory space.
Figure 6: The GPU memory model. This demonstrates the layout of a block partitioned into computationalthreads each executing the same set of instructions and the registers, shared memory, and global memorywhich each thread is capable of accessing to read/write data. [6]
19
![Page 21: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/21.jpg)
Figure 7: Relationship between streaming multi-processors and blocks. Each of the tesla’s 30 streamingmulti-processors is composed of 8 scalar processing cores which can each execute a single block at a time.Each scalar multi-processor can execute up to 32 threads of a block at once: this is called a warp. Theinstructions are defined in the kernel and are the same for each thread in keeping with the Single InstructionMulti-Data model of parallelism. As long as there are more blocks than streaming multi-processors theperformance time will increase with an increased number of streaming multiprocessors.
20
![Page 22: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/22.jpg)
represented as Gaussians, and four Inner loops sequencing through all shell primitive Gaus-
sians that compose the contracted Gaussian function. The GPU algorithm is significantly
different from the CPU algorithm. The following flow chart shows the sequence of compu-
tations necessary to compute the ERIs on the GPU it also highlights where control passes
from the CPU to the GPU:
Calculate Pre-
factors on CPU:
Equations 24 to 26
On CPU
Copy Pre-factors
onto GPU
Each block loads ap-
propriate pre-factors
into shared memory.
On GPU
Each block exe-
cutes the kernel to
calculate a unique
ERI (cf. 3.0.3)
Integrals translated
from Cartesian
co-ordinates to
Spherical Harmonics
Values Calculated
for ERIs trans-
ferred back to CPU
To port this computation to the GPU requires a scheme which maps the CPU computation
to the GPU in a way which effectively uses the GPU resources. The kernel runs a modified
McMurchie-Davidson algorithm which is just the classical McMurchie-Davidson algorithm
along with a scheme for dictating how to efficiently distribute the individual ERIs between
the different blocks on the GPU. The mapping used in this project is the same as one
used by Ufimtsev and Martinez[3]. They present three different schemes: the One-Thread
One-Contracted integral scheme, the One-Block One-Contracted integral scheme and the
One-Thread One-Primitive integral scheme. All of these different mapping techniques are
illustrated in figure 8.
The one block one contracted integral scheme has been implemented in this paper and is
shown schematically in Figure 8. The one block one contracted integral has two advantages;
first it possesses a medium level of parallel granularity, in other words the work is divided over
a relatively large number of threads and blocks, and the second is it is the most convenient
scheme for transporting the current design of the existing EXCITON code to GPU. A higher
21
![Page 23: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/23.jpg)
grain of parallelization means there are more threads in active execution. The One-Block
One-Contracted integral mapping scheme essentially states that each block will calculate
the coulombic repulsion between up to four electronic charge distributions represented as
contracted Gaussians (Equation 14). The longer the contraction length the better this algo-
rithm is meant to perform since each thread calculates the electronic repulsion between one
primitive integral with a final sum reduction on the block between each primitive integral
back into a contracted integral. If there are more primitive Gaussians making up the total
contraction then there will be more threads actively executing on each block and a higher
level of parallelism will be achieved.
The suggested mapping procedure starts by laying out all N atomic orbitals ψi, defined as
Gaussian functions: equation (15), as a square matrix. The matrix elements constructed will
then be bra and ket arrays of ψiψj. This square matrix will have a length M = N(N + 1)/2.
For example, the helium atom with a basis set of N = 3 requires a grid size M2 = 36. Due
to the (bra|ket) = (ket|bra) symmetry of the coulomb repulsion integral, only twenty one
(that is the diagonal and upper triangular part of the grid) of these integrals are unique; the
Figure 8: Different mapping algorithms of Electron repulsion integrals to device[3]. The One Block-OneContracted Integral is shown top left. Each of the block on the kernel grip calculate a single contracted ERI.Each thread of the block is calculating one of the primitive integrals in the contraction. At the end of thealgorithm there is a sum reduction to reform the contracted integral. The number of idle threads in thisscheme depend on the number of primitive integrals in the contraction.
22
![Page 24: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/24.jpg)
remaining integrals below the diagonal are redundant. Because of the layout of the GPU, and
the intrinsic parallelism, there is minimal expected performance degradation from including
the redundant integrals. However, future implementations might seek a way to eliminate
the need to include redundant integrals so that all the processors are generating relevant
information.
The first part of the computation, which includes generating the pairs of two-electron
repulsion integrals to be calculated and the recursively calculated pre-factors (section 3.0.3),
takes a small fraction of the total time: roughly 1 part in 1000. These terms are therefore
generated on the CPU and transferred to the GPU, where each of the highly multi-threaded
processors do further manipulations of the data (details are included in the flow chart) and the
final contractions for all the electron repulsion integrals. Once this computation is completed
the data is transferred back to the CPU and the rest of the EXCITON self-consistent field
program[11] can run.
6 Results and Discussion
The CUDA visual profiler[10] and predefined device functions allow the user to time different
segments of the computation and track the time allocated to memory transfers from host to
device, device to device, and device to host. It also allows the user to time the length of
a kernel call (the time required by the GPU to calculate the ERIs). MPI (message passing
interface) contains a built in library for timing different functions which was used time the
same parts of the computation in the CPU implementation. Figure 9 compares the time for
the GPU implementation of EXCITON vs. the CPU implementation of EXCITON.
Figure 9 displays the time in mili-seconds to calculate all the electron repulsion integrals
for He 1s2, Be 1s22s2, C 1s22sp4, and Ne 1s22s22p6 systems. Each system is modeled using
three gaussian functions. When electron repulsion integrals which are zero by symmetry
are omitted, the first three atoms treated in the graph have a total of 21 unique integrals
to calculate, and the Ne has a total of 31 integrals to calculate. For the smaller systems
the CPU implementation outperforms the GPU. This is likely to be a result of the small
numerical load of the computation. A highly optimized compiler is able to fit all the data in
the CPU cache, so the computation proceeds rapidly and the GPUs multi-core advantage is
23
![Page 25: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/25.jpg)
negligible. The neon system is the first to represent a noticeable improvement with a factor
of two performance increase.
In benchmarking the speed of different algorithms on different architectures there are a
number of considerations. While massive speed ups have been obtained using GPUs it is
important that the bench marking is fair and that the amount of work that has gone into
writing the GPU implementation of the code is comparable for the CPU implementation.
In this case the comparison is fair. The EXCITON software package has been carefully
considered and optimized over the last few years. Secondly the latest gcc 4.4.0 compiler
was used on the CPU code and it was compiled with the highest level of optimization flags
included at build time to ensure the compiler was doing everything possible to increase
performance. The GPU algorithm was also compiled with the highest optimization flags.
As has been stated, GPUs are meant to outperform CPUs only for data parallel numer-
ically intensive computations. While the properties of small atomic systems are capable of
being calculated in a data parallel fashion they are not that numerically intensive and are
unlikely to reveal the computing power of the GPU. Larger multi-atom systems would better
demonstrate speed ups. In the current trials only a very small fraction of the GPU resources
are being used.
Figure 9: Comparison of for the CPU ERI implementation vs the GPU implementation of the EXCITONcode. Displayed above are the times for calculating the electron repulsion integrals in four atomic systems.The improvement in timing on the GPU increases with the system size. For neon there is a factor of twoperformance increase with the GPU implementation. Speed ups are expected to improve as larger systemsare considered.
24
![Page 26: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/26.jpg)
The small atomic systems so far considered run very quickly on a CPU. The fact that
any speed ups at all have been detected for small systems is promising. This suggests that
as the code is extended to do computations on larger many atom systems the relative speed
up will be even greater. Also the One-Block One-Contracted integral mapping scheme is
meant to perform better for highly contracted Gaussian integrals; that is Gaussians with a
large number of primitive integrals. At the time of writing the basis sets being used consist
of Gaussian functions with only a single primitive integral. When larger basis sets are used
there should be an additional improvement over the CPU.
Figure 10: Comparison of kernel execution time for the Helium atom and the Beryllium atom. Inte-grals 2e kernel is the name of the process for calculating the ERIs. The memcopy functions are the memorytransfer of the pre-factors onto the device and transfer of the calculated integrals from the GPU back onto theCPU. The length of the bars indicates the amount of time required for each process. The graph demonstratesthat the actual computation of the Integrals is equivalent on the GPU the time it takes to transfer all thenecessary data onto the GPU is the limiting factor and exceeds that of the actual computation. The transfertime will become less significant with increased system size and synchronized memory transfers.
In Fig. 10 the CUDA visual profiler has been used to return information about the time
required for the different phases of the GPU algorithm. These phases, time to transfer data
onto and off of the device and the actual ERI algorithm itself, reveal some points of interest
about the One-Block One-Contracted integral mapping. Firstly Fig. 10 suggests that the
algorithm is bound by the number of Gaussian functions in the basis set rather than the
total number of electrons in the system. Beryllium has two more electrons than helium and
because of the O(N4) scaling of the time to calculate all of the ERIs for beryllium this system
might be expected to take fractionally longer. However since on the GPU all these integrals
are computed in parallel there is no appreciable difference in the time it takes to calculate the
electron repulsion integrals for the two molecular systems. This performance advantage of
the GPU should continue all the way up to the number of streaming multiprocessors (Figure
7) available on the GPU; in the case of the Tesla C1060 the number is 240. Extrapolating all
the way to this number suggest the maximum theoretical speed up on the GPU over a single
25
![Page 27: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/27.jpg)
core CPU is in the area of 200x. To test this would require full streaming multiprocessor
occupancy and large molecular systems.
Fig. 10 demonstrates one of the limitations of the proposed scheme. While the actual
calculation of the electron repulsion integrals is comparable or faster in the current GPU
implementation, transferring the necessary pre-factors into the GPU’s memory space and
then, when calculations are complete, back into the CPU’s memory space takes a prohibitive
amount of time. In fact for the small atomic systems the amount of time taken to transfer
the pre-factors onto the device requires a considerable amount of time relative to the actual
calculation. This represent a prohibitive bottle neck: the speed up on the GPU does not
compensate for the additional time required to transfer data from the host CPU onto the
GPU and back again. However this is only the case for small systems. The amount of data
transfer time will become less significant proportionally as the number of electron repulsion
integrals increases with the molecular system under consideration. Furthermore, in recent
NVIDIA releases of the CUDA toolkit, it is possible to write code which transfers data onto
and off of the GPU while the GPU is running calculations. This is called synchronization
and on large molecular systems the bandwidth transfer time could be completely hidden by
simultaneously calculating the electron repulsion integrals on the device while transferring
the pre-calculated factors into the device’s global memory. This simultaneous loading of
necessary pre-factors for the a certain batch of integrals would completely hide any latency.
A second anticipated problem with the chosen scheme is that ERIs involving higher angular
momentum wave functions will require much more time to run then lower angular momentum
wave functions. This means at any one time a number of processors will be occupied with
blocks that have finished their work cycle while they wait for other blocks, with higher angular
momentum wave functions to finish their computational cycle. The next batch of integrals
can’t be loaded until the entire first batch is finished and could cause some performance
degradation.
7 Conclusion
As has been discussed, the unique memory model and the number of processing cores per
GPU means that the computing potential is available to make significant improvements in the
26
![Page 28: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/28.jpg)
time requirements of different tasks. However, harnessing this power can be difficult. Devel-
opment using the CUDA programming language and the architecture of the GPU can prove
recalcitrant at times. Moreover, parallel programming introduces a number of complications
not present when writing serial code. For instance, a serial code will fail in a predictable and
consistent way; where as, parallel code may work nine times out of ten but fail on the tenth
run. This failure is often due to the random order in which different blocks are executed and
often this execution order will result in unpredictable consequences. However, as the coding
of the GPU implementation in this project progressed the GPU framework became much
easier to understand and development made more significant progress.
The field of CUDA and GPU programming is also very young. The Tesla C1060 device
used in this project is the first NVIDIA GPU that supports double precision computation so
there is relatively little established literature and reference material to aid a new programmer.
Even over the two months of the project new and important software for debugging applica-
tions, and visualizing how data was distributed in the GPU memory space became available
great simplifying error correction. Nor is the CUDA programming language by any means
static. The developers at NVIDIA and individual groups are ensuring that programming
languages for compute applications on GPUs are constantly evolving. Python, a natural and
popular language for numerical computing, is being implemented on GPU devices under the
name PYCUDA[8]. There are also movements towards an open source standardized language
of computing on GPUs. This language is less advanced than the CUDA framework at the
current time but is attracting a great deal of interest. The language is being developed un-
der the name of open compute language (OpenCL) analagous to the current Open Graphics
language already in use for using GPUs for their native graphical capabilities. The rapidly
evolving nature of this field means that even over a time scale of two months advances are
being made to ease the transportation of CPU code onto the GPU and fully exploit the
processing power of the GPU.
As a final note on the directions opened up by GPU architectures: Ufimtsev and Martinez
have developed programs that can calculate the ERIs and the exchange integrals entirely
on the GPU along with the rest of a self consistent field algorithm [5] .This has allowed
them to do ab initio molecular dynamics simulations of meaningful many body molecular
systems over practical time scales. The ability to calculate ab initio molecular dynamics on
27
![Page 29: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/29.jpg)
what is essentially a desktop computing machine is a non-trivial result. Using the linear
algebra packages on the GPU which are becoming highly optimized [16] to continue the
self-consistent field algorithms and writing code that uses the GPUs native ability to display
graphics could be used to speed up how quickly the EXCITON software package runs its own
self-consistent field programs, and generate Fermi surfaces and band structure of materials.
As users gain maturity and experience with GPUs, the compiler and debug software improves,
and the programming language becomes standardized, the processing powers of the GPU will
become essential for applications in computational materials science by providing fast data
processing and sophisticated visualizations techniques.
At the completion of the project a factor of two speed up for calculating the Electron Re-
pulsion Integrals was achieved. This is a very preliminary result. There remains a substantial
amount of coding and optimization to be done and a more conclusive result would rely on
the program computing larger, more complex molecular systems with more highly contracted
basis sets to see more impressive speed ups. Obtaining the two orders of magnitude perfor-
mance increases observed elsewhere [3] will rely on this optimization of the existing GPU
code and applying it to larger molecular systems. With the increasing sophistication of GPU
technology and software, and the promising indications of observed speed-ups on the exceed-
ingly small atomic systems considered here, there is hope that with more time the GPU
implementation of EXCITON in essentially the same form, will significantly outperform the
CPU implementation.
8 Acknowledgement
I’d like to acknowledge Dr. Charles Patterson for coding the EXCITON software package,
and for his useful advice and instruction.
28
![Page 30: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/30.jpg)
A NVIDIA Hardware Specifications
Device 0: "Tesla C1060"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Bandwidth Transfer
Tesla C1060
Quick Mode
Host to Device Bandwidth for Pageable memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2959.5
Quick Mode
Device to Host Bandwidth for Pageable memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2582.1
Quick Mode
Device to Device Bandwidth
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 73355.5
29
![Page 31: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/31.jpg)
B GPU Kernel Code
//ONE BLOCK ONE CONTRACTED INTEGRAL SCHEME (handles up to P Orbitals)
#include <stdio.h>
#include <cutil.h>
#include "ATOM_SCF_kernel.h"
#include "USER_DATA.h"
#include "myconstants.h"
__device__ void f000m_device(int , double*, double , double , int, double*);
__global__ void integrals_2e_kernel(DEVICE *device, INTEGRAL_LIST* integral_list,
SHELL *shells, double *d_F, double *d_fgtuvfinal)
VECTOR_DOUBLE r_12;
__shared__ double *p_F_temp;
__shared__ double *p_F_sh;
__shared__ double F_temp[784];
__shared__ double F_sh[225];
__shared__ double c1fac[3][3][3];
__shared__ double s_sab[1];
__shared__ double s_scd[1];
__shared__ double s_pab[1];
__shared__ double s_pcd[1];
__shared__ double s_fac1[1];
__shared__ double s_pinv[1];
__shared__ double s_c1x[12];
__shared__ double s_c2x[12];
__shared__ double f[5][5][5][5];
__shared__ double en[1][55];
__shared__ int tpupvp2, tpupvpsign;
__shared__ int tuv[8][6][3];
__shared__ double s_fgtuv[125];
int i1, j1, k1, l1;
int i, j, k, l, n;
double seven = 7.0000000;
double five = 5.0000000;
int count = 0;
int counter1, counter2;
int n1, n2, n3, n4, n5, n6;
int n7, n8, n9, n10, n11, n12;
int dime1, dime2, dime3, dime4;
int lim_ij, lim_kl;
int slim_ij, slim_jk, slim_kl;
int sheli, shelj, shelk, shell;
int sheli1, shelj1, shelk1, shell1;
int nsheli, nshelj, nshelk, nshell;
int op, oppshift1, oppshift2, oppshift3, oppshift4;
int *p_i, *p_j, *p_k, *p_l, *p_i1, *p_j1, *p_k1, *p_l1;
int index, index_i, index_j, index_k, index_l, i4, j4, k4, l4;
int sheli_lim, shelj_lim, shelk_lim, shell_lim;
int t, u, v, m;
int tp, up, vp;
int tmax, umax, vmax;
int tpmax, upmax, vpmax;
double *p_rot1, *p_rot2, *p_rot3, *p_rot4;
double tmp;
double fac, fac0, fac1, fac2, fac3;
int bfposi, bfposj, bfposk, bfposl;
30
![Page 32: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/32.jpg)
int bfposi1, bfposj1, bfposk1, bfposl1;
int gausposi, gausposj, gausposk, gausposl;
int imax, kmax, lmax, jmax;
int mm = 0;
bfposk1 = 0; bfposl1 = 0; bfposi1 = 0; bfposj1 = 0;
for (t = 0; t< 3; t++)
for (u = 0; u < 3; u++)
for (v = 0; v < 3; v ++)
c1fac[t][u][v] = 0;
for(i = 0; i < 5; i++)
for(j = 0; j < 5; j++)
for(n = 0; n < 5; n++)
for(t = 0; t < 5; t++)
f[i][j][n][t] = 0;
for (n = 0; n < 55; n ++) en[0][n] = 0.0;
if(blockIdx.y == 0 || blockIdx.y == 1 ||blockIdx.y == 2) index_i = 0;
if(blockIdx.y == 3 || blockIdx.y == 4) index_i = 1;
if(blockIdx.y == 1 || blockIdx.y == 3) index_j = 1;
if(blockIdx.y == 0) index_j = 0;
if(blockIdx.y == 2) index_j = 2;
if(blockIdx.y == 4) index_j = 2;
if(blockIdx.y == 5)
index_i = 2;
index_j = 2;
if(blockIdx.x == 0) index_l =0;
if (blockIdx.x == 0 || blockIdx.x == 1 || blockIdx.x ==2) index_k = 0;
if (blockIdx.x == 3 || blockIdx.x == 4) index_k = 1;
if (blockIdx.x == 1 || blockIdx.x == 3) index_l = 1;
if (blockIdx.x == 2 || blockIdx.x == 4) index_l = 2;
if (blockIdx.x == 5)
index_k = 2;
index_l = 2;
sheli = shells->type1_sh[index_i];
sheli1 = shells->type_sh[index_i];
imax = shells->imax_sh[index_i];
sheli_lim = sheli;
for (index = 0; index < index_i; index ++)
bfposj += shells->type1_sh[index_i - index - 1];
bfposj1 += shells->type_sh[index_i - index - 1];
gausposj += shells->ng_sh[index_i - index - 1];
shelj = shells->type1_sh[index_j];
shelj1 = shells->type_sh[index_j];
jmax = shells->imax_sh[index_j];
shelj_lim = shelj;
shelk = shells->type1_sh[index_k];
shelk1 = shells->type_sh[index_k];
kmax = shells->imax_sh[index_k];
shelk_lim = shelk;
shell = shells->type1_sh[index_l];
shell1 = shells->type_sh[index_l];
lmax = shells->imax_sh[index_l];
shell_lim = shell;
int counter;
for (counter = 0; counter < 1; counter++)
if ( (index_k == index_i && index_l < index_j) || (((imax + jmax) / 2 ) * 2 == imax + jmax) && (((kmax + lmax + 1) / 2 ) * 2 == kmax + lmax + 1) ||
(((imax + jmax + 1) / 2 ) * 2 == imax + jmax + 1) && (((kmax + lmax) / 2 ) * 2 == kmax + lmax)
) continue;
if(blockIdx.y > blockIdx.x) continue;
31
![Page 33: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/33.jpg)
mm = imax + jmax + kmax + lmax;
r_12.comp1 = 1.0;
r_12.comp2 = 1.0;
r_12.comp3 = 1.0;
s_sab[0] = device->sab[blockIdx.y + threadIdx.x];
s_pab[0] = device->pab[blockIdx.y + threadIdx.x];
s_scd[0] = device->scd[blockIdx.x + threadIdy.y];
s_pcd[0] = device->pcd[blockIdx.x + threadIdy/y];
s_fac1[0] = s_sab[0] * s_scd[0];
s_pinv[0] = (1.0 / s_pab[0]) + (1.0 / s_pcd[0]);
f000m_device(25, &en[0][0], 0.0, (1.0/ (*s_pinv)), mm, d_F);
for (n = 0; n <= mm; n++)
f[0][0][0][n] = en[0][n];
for (n = 0; n <= mm; n++)
f[1][0][0][n] = r_12.comp1 * en[0][n + 1];
for (n = 0; n <= mm; n++)
f[0][1][0][n] = r_12.comp2 * en[0][n + 1];
for (n = 0; n <= mm; n ++)
f[0][0][1][n] = r_12.comp3 * en[0][n + 1];
for (n = 0; n <= mm - 1; n++)
f[1][1][0][n] = r_12.comp1 * f[0][1][0][n + 1];
for (n = 0; n <= mm - 1; n++)
f[1][0][1][n] = r_12.comp1 * f[0][0][1][n + 1];
for (n = 0; n <= mm - 1; n++)
f[0][1][1][n] = r_12.comp2 * f[0][0][1][n + 1];
for (n = 0; n <= mm - 1; n++)
f[2][0][0][n] = r_12.comp1 * f[1][0][0][n + 1] + f[1][0][0][n + 1];
for (n = 0; n <= mm - 1; n++)
f[0][2][0][n] = r_12.comp2 * f[0][1][0][n + 1] + f[0][1][0][n + 1];
for (n = 0; n <= mm - 1; n++)
f[0][0][2][n] = r_12.comp3 * f[0][0][1][n + 1] + f[0][0][1][n + 1];
for (n = 0; n <= mm - 2; n++)
f[2][1][0][n] = r_12.comp1 * f[1][1][0][n + 1] + f[0][1][0][n + 1];
for (n = 0; n <= mm - 2; n++)
f[2][0][1][n] = r_12.comp1 * f[1][0][1][n + 1] + f[0][0][1][n + 1];
for (n = 0; n <= mm - 2; n++)
f[1][2][0][n] = r_12.comp2 * f[1][1][0][n + 1] + f[1][0][0][n + 1];
for (n = 0; n <= mm - 2; n++)
f[0][2][1][n] = r_12.comp2 * f[0][1][1][n + 1] + f[0][0][1][n + 1];
for (n = 0; n <= mm - 2; n++)
f[0][1][2][n] = r_12.comp3 * f[0][1][1][n + 1] + f[0][1][0][n + 1];
for (n = 0; n <= mm - 2; n++)
f[1][0][2][n] = r_12.comp3 * f[1][0][1][n + 1] + f[1][0][0][n + 1];
for (n = 0; n <= mm - 2; n++)
f[1][1][1][n] = r_12.comp1 * f[0][1][1][n + 1];
for (n = 0; n <= mm - 2; n++)
f[3][0][0][n] = r_12.comp1 * f[2][0][0][n + 1] + two * f[1][0][0][n + 1];
for (n = 0; n <= mm - 2; n++)
f[0][3][0][n] = r_12.comp2 * f[0][2][0][n + 1] + two * f[0][1][0][n + 1];
for (n = 0; n <= mm - 2; n++)
f[0][0][3][n] = r_12.comp3 * f[0][0][2][n + 1] + two * f[0][0][1][n + 1];
for (n = 0; n <= mm - 3; n++)
f[2][1][1][n] = r_12.comp1 * f[1][1][1][n + 1] + f[0][1][1][n + 1];
for (n = 0; n <= mm - 3; n++)
f[1][2][1][n] = r_12.comp2 * f[1][1][1][n + 1] + f[1][0][1][n + 1];
32
![Page 34: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/34.jpg)
for (n = 0; n <= mm - 3; n++)
f[1][1][2][n] = r_12.comp3 * f[1][1][1][n + 1] + f[1][1][0][n + 1];
for (n = 0; n <= mm - 3; n++)
f[2][2][0][n] = r_12.comp1 * f[1][2][0][n + 1] + f[0][2][0][n + 1];
for (n = 0; n <= mm - 3; n++)
f[2][0][2][n] = r_12.comp1 * f[1][0][2][n + 1] + f[0][0][2][n + 1];
for (n = 0; n <= mm - 3; n++)
f[0][2][2][n] = r_12.comp2 * f[0][1][2][n + 1] + f[0][0][2][n + 1];
for (n = 0; n <= mm - 3; n++)
f[3][1][0][n] = r_12.comp1 * f[2][1][0][n + 1] + two * f[1][1][0][n + 1];
for (n = 0; n <= mm - 3; n++)
f[3][0][1][n] = r_12.comp1 * f[2][0][1][n + 1] + two * f[1][0][1][n + 1];
for (n = 0; n <= mm - 3; n++)
f[0][3][1][n] = r_12.comp2 * f[0][2][1][n + 1] + two * f[0][1][1][n + 1];
for (n = 0; n <= mm - 3; n++)
f[1][3][0][n] = r_12.comp2 * f[1][2][0][n + 1] + two * f[1][1][0][n + 1];
for (n = 0; n <= mm - 3; n++)
f[1][0][3][n] = r_12.comp3 * f[1][0][2][n + 1] + two * f[1][0][1][n + 1];
for (n = 0; n <= mm - 3; n++)
f[0][1][3][n] = r_12.comp3 * f[0][1][2][n + 1] + two * f[0][1][1][n + 1];
for (n = 0; n <= mm - 3; n++)
f[4][0][0][n] = r_12.comp1 * f[3][0][0][n + 1] + three * f[2][0][0][n + 1];
for (n = 0; n <= mm - 3; n++)
f[0][4][0][n] = r_12.comp2 * f[0][3][0][n + 1] + three * f[0][2][0][n + 1];
for (n = 0; n <= mm - 3; n++)
f[0][0][4][n] = r_12.comp3 * f[0][0][3][n + 1] + three * f[0][0][2][n + 1];
for (n = 0; n <= mm - 4; n++)
f[2][2][1][n] = r_12.comp1 * f[1][2][1][n + 1] + f[0][2][1][n + 1];
for (n = 0; n <= mm - 4; n++)
f[2][1][2][n] = r_12.comp1 * f[1][1][2][n + 1] + f[0][1][2][n + 1];
for (n = 0; n <= mm - 4; n++)
f[1][2][2][n] = r_12.comp2 * f[1][1][2][n + 1] + f[1][0][2][n + 1];
for (n = 0; n <= mm - 4; n++)
f[3][2][0][n] = r_12.comp1 * f[2][2][0][n + 1] + two * f[1][2][0][n + 1];
for (n = 0; n <= mm - 4; n++)
f[3][0][2][n] = r_12.comp1 * f[2][0][2][n + 1] + two * f[1][2][0][n + 1];
for (n = 0; n <= mm - 4; n++)
f[0][3][2][n] = r_12.comp2 * f[0][2][2][n + 1] + two * f[0][1][2][n + 1];
for (n = 0; n <= mm - 4; n++)
f[2][3][0][n] = r_12.comp2 * f[2][2][0][n + 1] + two * f[2][1][0][n + 1];
for (n = 0; n <= mm - 4; n++)
f[2][0][3][n] = r_12.comp3 * f[2][0][2][n + 1] + two * f[2][0][1][n + 1];
for (n = 0; n <= mm - 4; n++)
f[0][2][3][n] = r_12.comp3 * f[0][2][2][n + 1] + two * f[0][2][1][n + 1];
for (n = 0; n <= mm - 4; n++)
f[4][1][0][n] = r_12.comp1 * f[3][1][0][n + 1] + three * f[2][1][0][n + 1];
for (n = 0; n <= mm - 4; n++)
f[4][0][1][n] = r_12.comp1 * f[3][0][1][n + 1] + three * f[2][0][1][n + 1];
for (n = 0; n <= mm - 4; n++)
f[0][4][1][n] = r_12.comp2 * f[0][3][1][n + 1] + three * f[0][2][1][n + 1];
for (n = 0; n <= mm - 4; n++)
f[1][4][0][n] = r_12.comp2 * f[1][3][0][n + 1] + three * f[1][2][0][n + 1];
for (n = 0; n <= mm - 4; n++)
f[0][1][4][n] = r_12.comp3 * f[0][1][3][n + 1] + three * f[0][1][2][n + 1];
for (n = 0; n <= mm - 4; n++)
f[1][0][4][n] = r_12.comp3 * f[1][0][3][n + 1] + three * f[1][0][2][n + 1];
for (n = 0; n <= mm - 4; n++)
f[5][0][0][n] = r_12.comp1 * f[4][0][0][n + 1] + four * f[3][0][0][n + 1];
33
![Page 35: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/35.jpg)
for (n = 0; n <= mm - 4; n++)
f[0][5][0][n] = r_12.comp2 * f[0][4][0][n + 1] + four * f[0][3][0][n + 1];
for (n = 0; n <= mm - 4; n++)
f[0][0][5][n] = r_12.comp3 * f[0][0][4][n + 1] + four * f[0][0][3][n + 1];
for (n = 0; n <= mm - 5; n++)
f[2][2][2][n] = r_12.comp1 * f[1][2][2][n + 1] + f[0][2][2][n + 1];
for (n = 0; n <= mm - 5; n++)
f[3][2][1][n] = r_12.comp1 * f[2][2][1][n + 1] + two * f[1][2][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[3][1][2][n] = r_12.comp1 * f[2][1][2][n + 1] + two * f[1][1][2][n + 1];
for (n = 0; n <= mm - 5; n++)
f[1][3][2][n] = r_12.comp2 * f[1][2][2][n + 1] + two * f[1][1][2][n + 1];
for (n = 0; n <= mm - 5; n++)
f[2][3][1][n] = r_12.comp2 * f[2][2][1][n + 1] + two * f[2][1][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[2][1][3][n] = r_12.comp3 * f[2][1][2][n + 1] + two * f[2][1][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[1][2][3][n] = r_12.comp3 * f[1][2][2][n + 1] + two * f[1][2][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[4][1][1][n] = r_12.comp1 * f[3][1][1][n + 1] + three * f[2][1][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[1][4][1][n] = r_12.comp2 * f[1][3][1][n + 1] + three * f[1][2][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[1][1][4][n] = r_12.comp2 * f[1][1][3][n + 1] + three * f[1][1][2][n + 1];
for (n = 0; n <= mm - 5; n++)
f[5][1][0][n] = r_12.comp1 * f[4][1][0][n + 1] + four * f[3][1][0][n + 1];
for (n = 0; n <= mm - 5; n++)
f[5][0][1][n] = r_12.comp1 * f[4][0][1][n + 1] + four * f[3][0][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[1][5][0][n] = r_12.comp2 * f[1][4][0][n + 1] + four * f[1][3][0][n + 1];
for (n = 0; n <= mm - 5; n++)
f[0][5][1][n] = r_12.comp2 * f[0][4][1][n + 1] + four * f[0][3][1][n + 1];
for (n = 0; n <= mm - 5; n++)
f[0][1][5][n] = r_12.comp3 * f[0][1][4][n + 1] + four * f[0][1][3][n + 1];
for (n = 0; n <= mm - 5; n++)
f[1][0][5][n] = r_12.comp3 * f[1][0][4][n + 1] + four * f[1][0][3][n + 1];
for (n = 0; n <= mm - 5; n++)
f[6][0][0][n] = r_12.comp1 * f[5][0][0][n + 1] + five * f[4][0][0][n + 1];
for (n = 0; n <= mm - 5; n++)
f[0][6][0][n] = r_12.comp2 * f[0][5][0][n + 1] + five * f[0][4][0][n + 1];
for (n = 0; n <= mm - 5; n++)
f[0][0][6][n] = r_12.comp3 * f[0][0][5][n + 1] + five * f[0][0][4][n + 1];
for (n = 0; n <= mm - 6; n++)
f[3][2][2][n] = r_12.comp1 * f[2][2][2][n + 1] + two * f[1][2][2][n + 1];
for (n = 0; n <= mm - 6; n++)
f[2][3][2][n] = r_12.comp2 * f[2][2][2][n + 1] + two * f[2][1][2][n + 1];
for (n = 0; n <= mm - 6; n++)
f[2][2][3][n] = r_12.comp3 * f[2][2][2][n + 1] + two * f[2][2][1][n + 1];
for (n = 0; n <= mm - 6; n++)
f[3][3][1][n] = r_12.comp1 * f[2][3][1][n + 1] + two * f[1][3][1][n + 1];
for (n = 0; n <= mm - 6; n++)
f[3][1][3][n] = r_12.comp1 * f[2][1][3][n + 1] + two * f[1][1][3][n + 1];
for (n = 0; n <= mm - 6; n++)
f[1][3][3][n] = r_12.comp2 * f[1][2][3][n + 1] + two * f[1][1][3][n + 1];
for (n = 0; n <= mm - 6; n++)
f[4][3][0][n] = r_12.comp1 * f[3][3][0][n + 1] + three * f[2][3][0][n + 1];
for (n = 0; n <= mm - 6; n++)
34
![Page 36: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/36.jpg)
f[4][0][3][n] = r_12.comp1 * f[3][0][3][n + 1] + three * f[2][0][3][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][4][3][n] = r_12.comp2 * f[0][3][3][n + 1] + three * f[0][2][3][n + 1];
for (n = 0; n <= mm - 6; n++)
f[3][4][0][n] = r_12.comp2 * f[3][3][0][n + 1] + three * f[3][2][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][3][4][n] = r_12.comp3 * f[0][3][3][n + 1] + three * f[0][3][2][n + 1];
for (n = 0; n <= mm - 6; n++)
f[3][0][4][n] = r_12.comp3 * f[3][0][3][n + 1] + three * f[3][0][2][n + 1];
for (n = 0; n <= mm - 6; n++)
f[5][2][0][n] = r_12.comp1 * f[4][2][0][n + 1] + four * f[3][2][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[5][0][2][n] = r_12.comp1 * f[4][0][2][n + 1] + four * f[3][0][2][n + 1];
for (n = 0; n <= mm - 6; n++)
f[2][5][0][n] = r_12.comp2 * f[2][4][0][n + 1] + four * f[2][3][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][5][2][n] = r_12.comp2 * f[0][4][2][n + 1] + four * f[0][3][2][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][2][5][n] = r_12.comp3 * f[0][2][4][n + 1] + four * f[0][2][3][n + 1];
for (n = 0; n <= mm - 6; n++)
f[2][0][5][n] = r_12.comp3 * f[2][0][4][n + 1] + four * f[2][0][3][n + 1];
for (n = 0; n <= mm - 6; n++)
f[6][1][0][n] = r_12.comp1 * f[5][1][0][n + 1] + five * f[4][1][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[6][0][1][n] = r_12.comp1 * f[5][0][1][n + 1] + five * f[4][0][1][n + 1];
for (n = 0; n <= mm - 6; n++)
f[1][6][0][n] = r_12.comp2 * f[1][5][0][n + 1] + five * f[1][4][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][6][1][n] = r_12.comp2 * f[0][5][1][n + 1] + five * f[0][4][1][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][1][6][n] = r_12.comp3 * f[0][1][5][n + 1] + five * f[0][1][4][n + 1];
for (n = 0; n <= mm - 6; n++)
f[1][0][6][n] = r_12.comp3 * f[1][0][5][n + 1] + five * f[1][0][4][n + 1];
for (n = 0; n <= mm - 6; n++)
f[7][0][0][n] = r_12.comp1 * f[6][0][0][n + 1] + six * f[5][0][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][7][0][n] = r_12.comp2 * f[0][6][0][n + 1] + six * f[0][5][0][n + 1];
for (n = 0; n <= mm - 6; n++)
f[0][0][7][n] = r_12.comp3 * f[0][0][7][n + 1] + six * f[0][0][5][n + 1];
for (i1 = 0; i1 <= mm; i1++)
for (j1 = 0; j1 <= mm; j1++)
for(k1 = 0; k1 <= mm; k1++)
s_fgtuv[i1 * ((mm+1)*(mm+1)) + j1 * (mm + 1) + k1] = s_fac1[0] * f[i1][j1][k1][0];
p_F_temp = F_temp;
for (i = 0; i < sheli * shelj *shelk * shell; i++)
*p_F_temp = zero;
p_F_temp++;
dime4 = shell_lim ;
dime3 = dime4*shelk_lim ;
dime2 = dime3*shelj_lim ;
dime1 = dime2*sheli_lim ;
p_F_sh = F_sh;
for (i = 0; i < sheli1 * shelj1 *shelk1 * shell1; i++)
*p_F_sh = zero;
p_F_sh++;
for (i = 0; i < 8; i++)
for (j = 0; j < 6; j++)
for (k = 0; k < 3; k++)
35
![Page 37: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/37.jpg)
tuv[i][j][k] = 0;
for (i = 0; i < sheli_lim; i++)
slim_ij = 0;
if (index_i == index_j) slim_ij = i;
for (j = slim_ij; j < shelj_lim; j++)
tmax = tuv[sheli_lim][i][0] + tuv[shelj_lim][j][0];
umax = tuv[sheli_lim][i][1] + tuv[shelj_lim][j][1];
vmax = tuv[sheli_lim][i][2] + tuv[shelj_lim][j][2];
n1 = tuv[sheli_lim][i][0];
n2 = tuv[sheli_lim][i][1];
n3 = tuv[sheli_lim][i][2];
n4 = tuv[shelj_lim][j][0];
n5 = tuv[shelj_lim][j][1];
n6 = tuv[shelj_lim][j][2];
slim_jk = 0;
for (k = 0; k < shelk_lim; k++)
slim_kl = 0;
if (index_k == index_l) slim_kl = k;
for (l = slim_kl; l < shell_lim; l++)
if (index_i == index_k && index_j == index_l && i * shelj_lim + j > k * shell_lim + l)
continue;
tpmax = tuv[shelk_lim][k][0] + tuv[shell_lim][l][0];
upmax = tuv[shelk_lim][k][1] + tuv[shell_lim][l][1];
vpmax = tuv[shelk_lim][k][2] + tuv[shell_lim][l][2];
n7 = tuv[shelk_lim][k][0];
n8 = tuv[shelk_lim][k][1];
n9 = tuv[shelk_lim][k][2];
n10 = tuv[shell_lim][l][0];
n11 = tuv[shell_lim][l][1];
n12 = tuv[shell_lim][l][2];
int ijmax = (imax + jmax + 1);
int klmax = (kmax + lmax + 1);
counter1 = 0; counter2 = 0;
for (t = 0; t <= tmax; t++)
for (u = 0; u <= umax; u++)
for (v = 0; v <= vmax; v++)
counter2 = 0;
for (tp = 0; tp <= tpmax; tp++)
for (up = 0; up <= upmax; up++)
for (vp = 0; vp <= vpmax; vp++)
tpupvpsign = -one;
tpupvp2 = tp + up + vp + 2;
if ((tpupvp2 / 2) * 2 == tpupvp2)
tpupvpsign = one;
c1fac[t][u][v] += s_fgtuv[(t + tp)*((mm+1)*(mm+1)) + (u+up)*(mm+1) + (v+vp)] * s_c2x[tp *(lmax + 1)*(kmax + 1) + n7 * (lmax + 1) + n10]\
* s_c2x[up * (kmax + 1)* (lmax + 1) * + n8 * (lmax + 1) + n11]\
* s_c2x[vp * ( lmax + 1) * (kmax + 1) + n9 * (lmax + 1) + n12]\
* tpupvpsign; // end t u v loop
for (t = 0; t <= tmax; t++)
for (u = 0; u <= umax; u++)
for (v = 0; v <= vmax; v++)
p_F_temp = F_temp + dime2 * i + dime3 * j + dime4 * k + l ;
*p_F_temp += c1fac[t][u][v] * s_c1x[t *(imax + 1)*(jmax + 1) + n1 * (jmax + 1) + n4]\
*s_c1x[u * (imax + 1) * (jmax + 1) * + n2 * (jmax + 1) + n5]\
*s_c1x[v * (imax + 1) * (jmax + 1) + n3 * (jmax + 1) + n6];
// end ijkl loop
nsheli = *(shells->num_ij + shells->ord_sh[index_i]);
nshelj = *(shells->num_ij + shells->ord_sh[index_j]);
nshelk = *(shells->num_ij + shells->ord_sh[index_k]);
36
![Page 38: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/38.jpg)
nshell = *(shells->num_ij + shells->ord_sh[index_l]);
p_i1 = shells->ind_i + shells->opp_sh[index_i];
p_i = shells->ind_j + shells->opp_sh[index_i];
p_rot1 = shells->rot + shells->opp_sh[index_i];
for (i = 0; i < nsheli; i++)
p_j1 = shells->ind_i + shells->opp_sh[index_j];
p_j = shells->ind_j + shells->opp_sh[index_j];
p_rot2 = shells->rot + shells->opp_sh[index_j];
for (j = 0; j < nshelj; j++)
p_k1 = shells->ind_i + shells->opp_sh[index_k];
p_k = shells->ind_j + shells->opp_sh[index_k];
p_rot3 = shells->rot + shells->opp_sh[index_k];
for (k = 0; k < nshelk; k++)
p_l1 = shells->ind_i + shells->opp_sh[index_l];
p_l = shells->ind_j + shells->opp_sh[index_l];
p_rot4 = shells->rot + shells->opp_sh[index_l];
for (l = 0; l < nshell; l++)
i1 = *p_i1;
j1 = *p_j1;
k1 = *p_k1;
l1 = *p_l1;
fac2 = one;
if (index_i == index_j && *p_i > *p_j) fac2 = zero;
if (index_k == index_l && *p_k > *p_l) fac2 = zero;
if (index_i == index_k && index_j == index_l && *p_i * shelj1 + *p_j > *p_k * shell1 + *p_l) fac2 = zero;
if (index_i == index_j && *p_i1 > *p_j1) j1 = *p_i1; i1 = *p_j1;
if (index_k == index_l && *p_k1 > *p_l1) l1 = *p_k1; k1 = *p_l1;
if (index_i == index_k && index_j == index_l && i1 * shelj + j1 > k1 * shell + l1)
tmp = i1; i1 = k1; k1 = tmp; tmp = j1; j1 = l1; l1 = tmp;
p_F_temp = F_temp + i1 * shelj * shelk * shell + j1 * shelk * shell + k1 * shell + l1;
p_F_sh = F_sh + *p_i * shelj1 * shelk1 * shell1 + *p_j * shelk1 * shell1 + *p_k * shell1 + *p_l;
*p_F_sh += fac2 * *p_F_temp * *p_rot1 * *p_rot2 * *p_rot3 * *p_rot4;
p_l++;
p_l1++;
p_rot4++;
p_k++;
p_k1++;
p_rot3++;
p_j++;
p_j1++;
p_rot2++;
p_i++;
p_i1++;
p_rot1++;
p_F_sh = F_sh;
for (i = 0; i < sheli1; i++)
for (j = 0; j < shelj1; j++)
for (k = 0; k < shelk1; k++)
for (l = 0; l < shell1; l++)
fac = one;
if (index_i == index_j && i == j) fac /= two;
if (index_k == index_l && k == l) fac /= two;
if ((index_i == index_k && index_j == index_l) && (i == k && j == l)) fac /= two;
if (fabs(*p_F_sh) > 1e-09)
d_fgtuvfinal[blockIdx.x + blockIdx.y * 6] = *p_F_sh * fac;
count++;
p_F_sh++;
37
![Page 39: Calculating Electron Repulsion Integrals on GPU and CPU ...harlambert.co.uk/_download/GPUvsCPU.pdf · instance, without using any symmetry arguments, a single benzene molecule which](https://reader034.vdocuments.us/reader034/viewer/2022052003/60161b08199c3b5ece24a7f5/html5/thumbnails/39.jpg)
References
[1] Boys, S. F. Electronic Wave Functions. I. A General Method of Calculation for the Stationary States of AnyMolecular System. Proceedings Royal Society London. Ser A 1950, 200, 542
[2] McMurchie, L. E.; Davidson, E.R. One- and Two-electron Integrals over Cartesian Gaussian Functions Journalof Computational Physics. 1978, 26, 218.
[3] Ufimtsev, I. S.; Martinez, J. T. Quantum Chemistry on Graphical Processing Units 1. Strategies for Two-ElectronIntegral Evaluation. Journal Chemical Theory and Computation. 2008, 4, 222-231.
[4] Ufimtsev, I. S.; Martinez, J. T. Quantum Chemistry on Graphical Processing Units 2. Direct Self-Consisten-FieldImplementation. Journal Chemical Theory and Computation. 2009, 5, 1004-1015.
[5] Ufimtsev S. I.; Martinez, J. T. Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradi-ents, Geometry Optimization, and First Principles Molecular Dynamics June 11, 2009
[6] Kirk, David.; Hwu, Wen-mei. Draft Cuda-Textbook: Programming Massively Parallel Processors. 2006-2008
[7] Moore, Gordon E. (1965). “Cramming more components onto integrated circuits”. Electronics Magazine. pp. 4.1965.
[8] Garg, Rahul. Masters Thesis: A compiler for parallel execution of numerical Python programs on graphicsprocessing units. Fall 2009. Department of Computer Science University of Alberta.
[9] NVIDIA CUDA C Programming Best Practices Guide Cuda Toolkit 2.3 July 2009.
[10] NVIDIA CUDA Programming Guide. Version 2.3.1 August 26 2009
[11] Almlof, J.; Faegri, K; Korsell, K. Principles for a direct SCF approach to LCAO-MO ab initio calculations. J.Computational Chemistry. 1982, 3, 385.
[12] Hohenberg, P.; Kohn, W. Inhomogeneous Electron Gas 1964 Phys. Rev. 136 B864.
[13] Szabo, A.; Ostlund, S. N. Modern Quantum Chemistry: Introduction to Advanced Electronic Structure TheoryMcGraw-Hill Inc. 1989
[14] Hedin, L. New Method for Calculating the One-Particle Green’s Function with Application to the Electron-GasProblemPhys. Rev. 139, A796(1965)
[15] Parr G. Robert.; Yang, Weitao.Density-Functional Theory of Atoms and Molecules. Oxford University Press.1989.
[16] Demmel, W. J.; Volkov, V. Benchmarking GPUs to Tune Dense Linear Algebra. Conference on High PerformanceNetworking and Computing. November 2008.
[17] Shi, Guochun. Implementation of Scientific Computing Applications on the Cell Broadband Engine processor.NCSA University of Illinois at Urbana-Champaign.
[18] Hill, D. M.; Marty, R. M.; Amdahl’s Law in the Multicore Era. University of Wisconsin. Google. July 2008.
[19] Amarasinghe, Saman. Introduction to Multicore Programming. Lecture 1. January 2007. MIT.
[20] Farber, Rob. textslCUDA, Supercomputing for the Masses. 15 April 2008.
[21] Pople, J. A.; Segal, G. A. Approximate Self-consistent Molecular Orbital Theory III. CNDO Results for AB2and AB3 systems, J. Chem. Phys . 44 (1966) 3289
[22] Figure from http://en.wikipedia.org/wiki/Amdahl’s_law. Creative Commons Licence
38