calculating electron repulsion integrals on gpu and cpu...

Calculating Electron Repulsion Integrals

on GPU and CPU Architectures

Henry Lambert

January 7, 2010

1

Contents

1 Abstract 2

2 Introduction 3

3 Physical Considerations 7

3.0.1 Electron-Electron Repulsion Algorithm . . . . . . . . . . . . . . . . . 7

3.0.2 Gaussian Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.0.3 McMurchie-Davidson . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Computational Considerations 13

4.1 GPU Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Method: ERI Implementation on GPU 18

6 Results and Discussion 23

7 Conclusion 26

8 Acknowledgement 28

A NVIDIA Hardware Specifications 29

B GPU Kernel Code 30

1 Abstract

Large demand for high quality Graphical Processing Units (GPUs) for applications in video

games and home entertainment systems has led to the introduction of machines with enor-

mous computing potential at a relatively low price. The NVIDIA Corporation has provided a

C-like extension language called Compute Unified Device Architectures[9] to allow scientific

programmers to exploit the processing power of the GPU for intensive numerical computa-

tions. Computational Chemistry is a field full of numerically intensive computing tasks; in

particular the calculation of the coulombic electron repulsion integrals (ERIs) in molecular

systems scales as O(N4) with the number of electrons in a system. This rapid increase in

computational demands with system size represents a significant bottleneck in all ab initio

molecular modeling programs. Building on the McMurchie-Davidson algorithm[2] (which

describes a method for the computation of ERIs), this project attempted to implement the

ERI section of the existing EXCITON software package onto a GPU using a method devised

by Ufimtsev and Martinez [3]. Ufimtsev and Martinez have used GPUs to speed up the

calculation of ERIs for non-trivial chemical systems by a factor of up to 140 times. This

project’s GPU implementation of ERIs in the EXCITON software package was found to run

at comparable speeds to the CPU implementation of EXCITON for the smallest atomic sys-

tems like He and Be, and to run faster by a factor of 2 for the larger Neon atom system. As

the code is completed to run on larger, many-atom systems, this performance enhancement

is expected to continue to increase favourably.

2

2 Introduction

The demand for Graphical Processing Units (GPUs) in home entertainment systems has led

to the introduction of machines with enormous computing potential at a low price. The

computing potential and availability of GPUs have made them good candidates to increase

the speed and efficiency of high performance computers for applications in computational

sciences. The use of GPUs could allow users to increase computational performance by

orders of magnitude over traditional multi-core architectures while still using a high level

language with distinct similarities to C [20]. Figure 1 demonstrates the scale of performance

increase of GPUs compared to traditional Central Processing Units (CPUs) in the last few

years.

Figure 1: Relative performance in GFlops (giga-Floating Point Operations Per Second) of different GPUmodels vs CPUs in linear algebra applications at peak performance over the last six years.

This wide divergence in performance shows that, for parallel computing applications, CPU

speeds are being soundly trumped by that of the GPUs. Not only are the GPUs powerful

computing machines but they are relatively inexpensive. These graphics cards have a large

number of processing cores (the element of the chip which executes instructions) per device

and a distributed shared memory model: both of which are suited to doing intensive, parallel,

numerical computations as will be discussed in more detail subsequently. The NVIDIA cor-

3

poration is one of the leaders in graphics card production and is interested in fostering a new

market for professionals looking to exploit their technology in diverse fields: computational

finance, fluid dynamics, and finite element analysis. NVIDIA has provided a C-extension

language called CUDA[10] (Compute Unified Device Architecture) in order to facilitate port-

ing code from CPUs onto the GPUs to exploit these highly parallel systems in search of

massive speed ups. This project sought to transport the section of an existing code package

called Exciton1 concerned with the calculation of Electron Repulsions Integrals (ERIs) to the

GPU. EXCITON is a code for calculating the influence of many body effects in molecular sys-

tems. Incorporating some GW code [14] and considerations introduced by the Bethe-Salpeter

equations, the EXCITON software package is meant to improve upon traditional mean field

theory by considering the motion of electron and hole bound states and their interaction with

plasmons and phonons in a solid. Any speed-up in the calculation of two-electron repulsion

integrals obtained during this transport onto the GPU will be benchmarked and compared to

the original CPU implementation, difficulties and advantages of programming using the GPU

will be discussed, and possibilities for further implementation of traditional computational

techniques in solid state chemistry on this novel architecture will be considered.

Quantum theory is capable of a full description of the dynamics of any molecular system.

However, solving the equations which govern the dynamic development of these molecular sys-

tems quickly becomes intractable. Density Functional Theory (DFT) [12] and Hartree-Fock

(HF) Theory are the two most popular frameworks for solving the many-electron Schrodinger

equation governing the dynamics of a molecular quantum system. An important step in both

DFT and HF is the calculation of the classical coulomb repulsion force between two electrons

in a molecular system. The number of these integrals to be evaluated scales as O(N4), where

N is the number of electrons in the interacting molecular system. This O(N4) scaling means

for even relatively small systems the number of integrals to be evaluated is very large. For

instance, without using any symmetry arguments, a single benzene molecule which has 42

electrons, with would require the calculation of 388 962 electron repulsion integrals. That

holds for an isolated benzene molecule in a vacuum. For physically interesting systems, with

real technological applications, one may want to consider a system of graphene with hundreds

of six-carbon rings. A system of this size would push the number of two-electron repulsion

1The main developers of the EXCITON software package are Charles Patterson, Svetjlana Galamic-Mulaomerovic, and ConorHogan. Contact: [email protected]

4

integrals to be evaluated well into the billions. Addressing this computational workload is

of real importance for describing the electronic properties of a system and requires novel

approaches including restructuring the traditional algorithm and experimenting with new

computing architectures.

To model the electronic structure of a molecular system requires an initial ansatz to de-

scribe the orbitals of the system’s electrons. From this ansatz it is possibly to iterate towards

a description of the molecular system’s ground state. S.F. Boys[1] was the first to suggest

using a linear combination of Gaussian type orbitals as the ansatz for the electronic charge

distribution. From the Gaussian orbitals it is possible to obtain explicit numerical descrip-

tions of the electronic orbitals in any molecular system; in Boys’ words, “to any required

degree of accuracy with the only bound being, the labour of computation.” The advantages

of these Gaussian orbitals are that their integrals can be evaluated in a straightforward man-

ner and that they are convenient for modeling charge distributions for electronic orbitals of

any angular momentum on multiple nuclear centers. The use of Gaussian Type Orbitals

and the McMurchie-Davidson scheme has gained widespread use and has facilitated the nu-

merical modeling of chemical systems. These schemes have benefited enormously from the

exponential growth in computer processing power implied by Moore’s Law.

In 1965 Gordon Moore wrote an article[7] for Electronics Magazine where he projected

that the the number of transistors on a silicon chip at a given price point would double every

eighteenth months. The number of transistors on a chip is intimately correlated with the

processing speed a computer is capable of achieving.

Moore’s law has held up to a remarkable degree of accuracy since its original statement.

Today however the semi-conductor industry is approaching the physical limits on the number

of transistors which can be placed on a single silicon chip. Excessive heat generation due to

transistor density and quantum mechanical effects, such as electrons tunneling through the

few nanometers which separate transistors, present fundamental engineering difficulties. To

ensure that processing power continues to grow exponentially new approaches are necessary.

New technologies and compute architectures are needed to ensure the next generation of

processors to extend Moore’s law.

A popular model for extending Moore’s law involves introducing graphical processing units

(GPUs) for computing applications in a GPU-CPU hybrid computing system. This hybrid

5

computing approach involves distributing work to the device architecture according to its

nature. Serial code should be run on the CPU and numerically intensive processes which

are capable of being parallelized should be run on the GPU. This is because the compute

architecture of GPUs is much better suited to highly parallelized applications then that of

a traditional CPU. The NVIDIA programming guide describes succinctly how the features

of GPUs that make it suitable for image rendering translate to highly parallel scientific

computing applications:

In 3D rendering, large sets of pixels and vertices are mapped to parallel threads.

Similarly, image and media processing applications such as post-processing of ren-

dered images, video encoding and decoding, image scaling, stereo vision, and pat-

tern recognition can map image blocks and pixels to parallel processing threads.

In fact, many algorithms outside the field of image rendering and processing are

accelerated by data-parallel processing, from general signal processing or physics

simulation to computational finance or computational biology. [10]

The GPU contains many more processing cores, a shared memory model with high on-device

bandwidth, and a highly multi-threaded structure for image processing; all these features

favour parallelized data processing.

Professor Saman Amarasinghe of MIT has suggested that the combination of CPU and

GPU processors into a single hybrid computing model represents a paradigm shift that mir-

rors previous transitions in computing technology. He cites the shift from assembly language

in the 1970s to high-level languages like C and Fortran, and the transition to object oriented

programming in the 1990s[19]. The first paradigm shift was meant to give the programmer

more freedom to consider abstract algorithms without the need to focus on detailed aspects of

the computer’s physical architecture. The shift to object oriented programming was meant

to facilitate the integration of huge bodies of code from large-scale software development

projects with a large number of widely distributed programmers. This transition to a hybrid

computing model will require the re-thinking of traditional algorithms to utilize the process-

ing power of the GPU and surpass the aforementioned engineering difficulties that hinder

the continuation of Moore’s law.

6

3 Physical Considerations

3.0.1 Electron-Electron Repulsion Algorithm

In order to predict and explain the electronic, magnetic, and structural properties of molec-

ular systems it is necessary to compute the quantum mechanical wave equations governing

their dynamics. In crystal systems there are multiple nuclear centers surrounded by varying

numbers of electrons. These sorts of systems are many-body systems. Numerous approaches

have been developed for studying many-body interactions but the problem remains exceed-

ingly difficult even for idealized situations. These many-body systems possess far too many

degrees of freedom to be reduced to a simple analytical system of equations. It therefore

becomes necessary to develop numerical approaches to the solutions of the equations and

the modeling of their behaviour. The most widely used approaches to studying many body

molecular systems involve Hartree-Fock methods, and Density Functional Theory. In both

cases the ground state energy of a system can only be determined by solving a large system

of non-linear differential integral equations. A discussion of the Hartree-Fock methods and

certain considerations in DFT describe the physical context in which two electron repulsion

integrals arise and what information can be gained by calculating them. This section is the

most mathematically intensive of the paper and introduces key notation. It begins with a

brief definition of the Slater Determinant, and then states the Hamiltonian used to describe

the dynamics in these molecular systems. The second half of this section focuses on the rep-

resentation of electronic wave functions by gaussian orbitals and the method of calculating

two electron coulomb repulsion integrals algorithmically. Understanding the physical context

in which ERIs arise and the algebraic and algorithmic steps to calculate them are important

for understanding how the computation is implemented on the GPU.

In Hartree-Fock theory, the ground state many-electron wave equation is constructed from

a Slater Determinant (equation 1) of single electron wave functions. Higher energy states

can then be constructed from linear combinations of these “detors”2. The properties of such

determinant ensures that all the conditions for a system of interacting fermions are met;

for instance, it ensures that the wave function is anti-symmetric with respect to particle

exchange.

2S. F. Boys suggested calling these determinants of normalized, orthogonalized single particle wave functions “detors” butthat never made it into general use

7

ΨHF0 =

1√N !

∣∣∣∣∣∣∣∣∣ψ1(r1) ψ2(r1) . . .

ψ1(r2) ψ2(r2) . . ....

.... . .

∣∣∣∣∣∣∣∣∣ (1)

In the Hartree Fock scheme this ground state wave function ΨHF0 is then minimized by

finding the appropriate ψis. Starting from a trial solution the orbitals are systematically

modified until the system is in its lowest energy conformation. The inherent assumptions are

that the system is not relativistic and the nuclei are static.

The Schrodinger equation for the Hartree-Fock approach can be written [15]:

HHF |Ψ0 >= EHF |Ψ0 > (2)

where the Hamiltonian HHF is composed of the one electron terms:

hi =

(−~2

2m∇2i −

Ze2

ri

)(3)

the two electron coulomb repulsions integrals:

Ji(r) =

∫ρ(r′)

|r − r′|dr′ (4)

which take the expectation value:

< Jij >=

∫ ∫drdr′ψi(r)ψ

∗i (r)

e2

|r − r′|ψj(r

′)∗ψj(r′) (5)

and the two electron exchange integrals (a purely quantum mechanical interaction arising

in the expansion of the determinant):

Ki(r, r′) =

∫ψ∗j (r

′)ψi(r)

|r − r′|(6)

which have the expectation value:

< Kij >=

∫ ∫drdr′ψ∗i (r)ψj(r)

e2

|r − r′|ψi(r

′)ψ∗j (r′) (7)

It is the coulombic Jij and Kij terms which quickly proliferate as will be seen in section

8

3.0.3. The Kij terms require a different implementation on the GPU and are treated by

Ufimtsev and Martinez[4]. This paper is only concerned with the calculation of the classical

coulombic two-electron repulsion integrals.

Even though it introduces no new physics, it is worth looking at the representation of the

previously mentioned one electron eqn. 8 and two electron egn. 9 terms arising in Hartree-

Fock theory in the notation of second-quantization. It is this notation that will be used to

refer to the one and two electron terms in the following discussion and conclusion.[13]

Single electron terms:

O1 =∑ij

hija†iaj (8)

Two electron terms:

O2 =1

2

∑ijkl

gijkla†ia†kajal (9)

3.0.2 Gaussian Functions

In an important paper, S. F. Boys [1] showed that to obtain the numerical electronic wave

functions in any atomic or molecular system, it is sufficient to model the system entirely

with Gaussian functions and the derivatives of these Gaussian functions. Gaussians are not

only sufficient for constructing a system of functions that accurately describe any molecular

system but also allow for the explicit evaluation of the necessary integrals that arise. Boys

concluded that Gaussian functions satisfactorily model the charge distribution of electrons

in crystal systems, to any degree of accuracy, and could be used, “with the molecular orbital

method, or localized bond method, or the generalized method of treating linear combinations

of many Slater Determinants by the variational procedure.” Gaussian functions also possess

certain mathematical properties which make them easy to manipulate.

A model gaussian for an s orbital takes the form[13]:

φGF1s (α, r−RA) = (2α

π)

34 e−α|r−RA|2 (10)

Where RA is the nuclear co-ordinate and r is the electron’s co-ordinate. Modeling the

charge distribution using a Gaussian differs in form from the analytical treatment of a 1s

electron. The analytic solution for a single electron wave function takes the form of a Slater-

9

type orbital:

φSF1s (γ, r−RA) = (γ3

π)

12 e−γ|r−RA| (11)

The functional behaviour of a Gaussian is different from a Slater type orbital: particularly

in the difference of the value of the derivative at r = 0, and the faster falling off of the Gaussian

function at large r. This difference in functional behaviour is mitigated partly by the ease

of computation for Gaussian functions and partly by taking linear combinations of multiple

Gaussian functions to better approximate the Slater function.

The ease of computation of Gaussian functions alluded to earlier comes from the fact that

the product of two Gaussian orbitals is again a Gaussian and that easily defined recursive

relations exist to calculate the higher order derivatives of Gaussian functions. The first

property of Gaussians mentioned allows one to reduce the product of two electronic charge

distributions on two different nuclear centers to a single charge distribution around a single

modified center.

φGF1S (α, r−RA) ∗ φGF1S (β, r−RB) = KABφGF1s (p, r−RP) (12)

KAB = (2αβ

(α + β)π)

34 exp(

−αβα + β

|RA −RB|2) (13)

Where p = α+β is the sum of the Gaussian exponents, and RP is the reduced co-ordinate.

As stated previously, the functional behaviour of a Gaussian is different from the functional

behaviour of the analytic solution. To better imitate the functional behaviour of the ana-

lytically determined Slater function a linear combination of “primitive” Gaussian functions

(identical in form to a normal Gaussian) is used to form a contracted Gaussian function. The

number of Gaussian functions in this contraction is called the contraction length L. These

contracted Gaussian basis sets have been numerically optimized and a number of readily

obtained, standardized basis sets exist[21].

φCGFµ (r−RA) =L∑p=1

dpµφCGFp (αpµ, r−RA) (14)

The integrals for a contracted basis function are obtained just by summing up integrals

10

over the individual primitive functions using the appropriate contraction co-efficients. It is

important to note the co-efficients of each primitive are not optimized: only the contraction.

This sum over primitive integrals increases the computation workload and is addressed in

section 5.

3.0.3 McMurchie-Davidson

The McMurchie-Davidson scheme[2] for calculating two electron integrals over cartesian gaus-

sian functions is reproduced here. The original paper gives a method for computing these

integrals by using a few simple auxiliary functions which are in turn defined using differ-

ent recursion relations. Understanding certain aspects of this algorithm will make certain

choices that were made for transferring the algorithm to the GPU clearer. It will also help

understand where performance limitations come in between algorithmic demands and the

physical capabilities of the GPU. For instance, for higher angular momentum functions the

McMurchie-Davidson scheme requires the temporary storage of so many double precision

numbers that the arrays will exceed the capacity of shared memory (discussed in section

5) causing a performance degradation. The important thing to take away from this section

is that all two-electron repulsion integrals, between orbitals of any angular momentum, are

proportional to the derivatives of the x, y, z co-ordinates of the incomplete gamma function.

Start from a generalized Gaussian basis function describing an electron’s charge distribu-

tion on nuclear center A:

φ(n, l,m, αA, A) = xnAylAz

mA e−αAr2

A (15)

and then incorporate another gaussian basis function describing an electron’s charge dis-

tribution on nuclear center B. The charge distribution of the two basis functions on the two

different nuclear centers A and B can be written:

Ωij = φiφj = xnAxn′

B ylAy

l′

BzmA z

m′

B exp[−(αAr2A + αbr

2B)] (16)

Using the relations from 3.0.2 this finally takes the form:

exp[−(αAr2A + αbr

2B)] = Eijexp[αpr

2p] (17)

11

where

Eij = exp[− αaαbαA + αB

|A−B|2]

The essence of the algorithm is that higher angular momentum wave functions are de-

scribed as simply the partial derivatives of an s-orbital wave function. The s-orbital wave

function is then just proportional to the incomplete gamma function. Hence following Boys,

McMurchie-Davidson writes the general two electron repulsion integral between electronic

orbitals of any angular momentum wave function as a series of partial derivates of a zero

angular momentum Gaussian function:

[NLM | 1

r12

|N ′L′M ′] = (∂

∂Px)N(

∂

∂Py)L(

∂

∂Py)M(

∂

∂Qx

)N′(∂

∂Qy

)L′(∂

∂Qz

)M′([000| 1

r12

|000]) (18)

where P and Q are the nuclear centers:

P =αA + βB

α + β(19)

Q =γC + δD

γ + δ(20)

To calculate equation 18 efficiently, Mcmurchie Davidson defines a further auxiliary func-

tion:

RNLM = (∂

∂a)N(

∂

∂b)L(

∂

∂c)M∫ 1

0

e−Tu2

du (21)

where T from equation 29 is defined:

T = α(a2 + b2 + c2) (22)

Using this auxiliary equation and a series of recursion relations defined by McMurchie

Davidson then equation 18 can be computed in a straightforward manner. Since the GPU

does not support recursive function calls the recursion relations defined in McMurchie-

Davidson to calculate 21 had to be expanded as a series of nested for-loops to be executed

by the GPU.

The pre-factors which are computed recursively on the CPU are stated in equations 24-26

and are calculated using recursion relations similar to that of Hermite polynomials. Only the

12

results for the x axis are reproduced here but the arguments are easily extended for the y

and z co-ordinates:

xAΛN(xp;αp) = NΛN−1 + ~(PA)xΛN +1

2 ∗ αpΛN+1 (23)

Λj(xp;αp)e−αpx2

p) = (∂

∂Px)j (24)

Λj(xp;αp) = αj/2p Hj(α12p xp) (25)

xnAxn′

B =n+n′∑N=0

dnn′

N ΛN(xp;αp) (26)

Hj(α12p xp) are Hermite polynomials defined in the usual way.

The core integral in equation 18 was calculated by Boys to be:

[000|r−112 |000] = λF0(T ) (27)

Where F0 is the incomplete gamma function, and Lambda and T are terms that will be

pre calculated on the CPU as discussed in section 5.

λ =2π5/2

αPαQ(αP + αQ)(28)

T =(PQ)2

αPαQ(αP + αQ)(29)

Besides describing these efficient recursive algorithms for calculating different components

of the ERIs, the McMurchie-Davidson scheme also details a method for evaluating the incom-

plete Gamma function to the required numerical accuracy through the use of an interpolating

table. This procedure resides as a device side function.

4 Computational Considerations

4.1 GPU Hardware

The advantage of a GPU over a CPU in speed stems from the physical layout of the GPU

device: specifically the way memory is distributed on the chips in the GPU, and the GPU’s

13

multiple processing cores, which can operate in parallel. The processing cores are the part of

the computer which execute the instructions defined by the programmer. The advantage of

the GPU for computing comes from threading resources.The NVIDIA Programming guide

explains the relative advantage of the GPU model, “Execution pipelines (CPU cores) on host

systems can support a limited number of concurrent threads. Servers that have four quad-core

processors today can run only 16 threads in parallel ... By comparison, the smallest executable

unit of parallelism on a GPU device, called a warp, comprises 32 threads. All NVIDIA GPUs

can support 768 active threads per multiprocessor, and some GPUs support 1,024 active

threads per multiprocessor. On devices that have 30 multiprocessors this leads to more than

30,000 active threads.” [9] The maximum theoretical limit on the number of threads which

can be run on a Tesla Device is 2198956147200 threads. The actual implementation of the

ERI algorithm on the GPU will be discussed in section 5. This section will describe the basic

considerations of general parallel computing and the different layouts of the GPU architecture

and the traditional CPU architecture.

To determine whether or not a computation will benefit from being transported to a

parallel architecture, Amdahl’s law gives an approximation to the expected benefits of par-

allelization. A modern statement of Amdahl’s law is given by equation 30 which states that

if a fraction f of a computation is sped up by a factor of S, the speed up for the entire code

will be[18]:

Speedupoverall(f, S) =1

(1− f) + fS

(30)

Graphically this law can be interpreted as in figure 2. Clearly the most important factor

is the proportion of the existing algorithm which can be parallelized.

Generating all the combinations of electronic repulsion integrals that need to be calculated

is a relatively small proportion of the calculation; for certain systems this can amount to 1

part in a thousand of the total execution time. This small serial component of the ERI

algorithm means the fraction of the computation that can be parallelized and sped up is very

large. This makes the denominator in equation 30 very small and hence the total potential

benefit from parallelizing the algorithm across the many core architecture of the GPU very

large. This large potential benefit of running the ERI algorithm across the large number of

processing cores available on the GPU is the motivation for this project.

The general paradigm for parallel programming breaks into two models: the master-slave

14

Figure 2: This figure demonstrates the relative performance enhancement of a computation with respects tothe fraction of the algorithm which can be parallelized and divided between multiple processors.[22]

model and the peer model[9]. In the master-slave model, the master node runs a serial

program and then distributes the more computationally intensive aspects to its slave nodes.

In the peer model, a collection of nodes runs the same instructions simultaneously. The

GPU-CPU model in fact possesses both levels of parallelism. The first level is master-slave

parallelism where the CPU runs a serial program and then calls its ‘slave node’, the GPU,

to run the computationally intensive electron repulsion integral algorithm in SIMD fashion.

The GPU itself is the second level of parallelism. It is a peer model parallel processor with

division into different streaming multi-processors called blocks which are composed of an

ordered array of threads capable of running instructions in parallel.

Kirk Hwu and David Kirk of Nvidia use the analogy of a peach to demonstrate the

appropriate application of GPUs within this paradigm (Figure 3). The hard peach pit is the

serial component of an algorithm, which the CPU can effectively process rapidly, using its

own high clock rates, memory architecture, and compiler techniques. The soft peach flesh

represents the data-parallel aspect of the algorithm which can be divided across the 240

processors cores of the Tesla GPU for high throughput computing.

Figure 4 is a comparative schematic of the GPU and CPU architectures. Each of the

figures in this section has a fairly detailed caption. Studying the schematic figures is the

easiest way to get a sense of the computational resources that the GPU provides and how

15

Figure 3: David Kirk and Wen-mei Hwu’s peach analogy. The hard pit is the serial component of thealgorithm and the flesh part is ripe for data parallel exploitation. Trying to bite into the serial componentis equivalent to porting intrinsically serial code onto the GPU: it will not be successful. The flesh part issuitable for GPU implementation. [6]

these resources influence programming decisions.

The type of data parallelism used by current NVIDIA GPUs is essentially Single Instruc-

tion Multi Data (SIMD). The NVIDIA company prefers to call it SIMT parallelism (Single

Instruction Multi-Thread) to emphasize the highly threaded design of their GPU devices, but

the details of the two approaches are identical. SIMD means that the same set of instructions

are defined and executed by the many processing threads in parallel on the entirety of a large

data set. The set of instructions that is executed is defined in the kernel using NVIDIA’s

CUDA programming framework. The NVIDIA programming guide states, “the kernel is the

set of instructions that is executed on the GPU. C for CUDA extends C by allowing the

programmer to define C functions, called kernels that, when called, are executed N times in

parallel by N different CUDA threads, as opposed to only once like regular C functions”[10].

The kernel function determines the tasks that each individual thread will perform. The indi-

vidual kernel executing threads are organized into blocks. In the GPU programming model

each of these cores loads a single self contained computational unit called a block. For this

project each single computational unit or block will calculate a single Electron Repulsion

Integral. The block is a grid of threads which is loaded into one of the GPU’s cores and

executed from start to finish. The number of threads per block is limited by the amount

of shared memory available to each processing core/block. Each processing core executes a

single block at a time, from start to finish, before loading the next block. This means that

the number of threads per block is limited by the GPU resources like register space (which

stores temporary variables) and shared memory space.

Inspection of Figures 4 and 5 shows the cache memory (fast access memory) is a single large

16

contiguous grouping, as are the Arithmetical Logical Units (which are transistors responsible

for the mathematical manipulations), and the instruction unit (labeled control). This device

topology is optimized for large serial computations: a small number of cores distributing data

in a linear fashion to nearby ALUs and accessing a large pool of nearby cache memory. The

GPU architecture has a very different layout to the CPU. The key consideration engineers

have taken into consideration when designing the GPU is parallel algorithm execution. A

graphics card for high quality video rendering needs to update the individual pixels on a

screen many times per second therefore there are multiple control centers with their own high

speed cached memory and a large number of ALUs per processor to meet this engineering

requirement. Included in appendix A are the technical specifications of the GPU used in this

project.

Figure 4: Comparison of CPU chip layout and GPU chip layout. The green ALUs (arithmetical logical pro-cessing units) are responsible for doing all the math and logic manipulations of the data. As is demonstratedin the schematic the GPU has a large number of ALUs allocated to each control unit or streaming multipro-cessor (reference section 7) and its own easily accessed share memory. This layout allows the control unit tomanipulate at high speed large quantities of data in parallel. The ‘DRAM’ in both cases is not located onchip, requires more clock cycles to access, and is called global memory.

The GPU memory schematic is included in Figure 6. For threads residing on the same

block there is a high degree of data-visibility; that is to say, each thread in the same block

can communicate rapidly. This high visibility is mediated through the shared memory and

registers all located within the block. There is 16 kb of shared memory and 16kb of registers

available on each block. If the required data does not fit into these confines, or the size of

certain arrays aren’t defined at compile time there is memory overflow and the compiler will

start allocating the data to the address space in local memory. Local memory, in terms of

the layout of the graphics chip is physically equivalent to global memory. Accessing data

stored within global memory possesses a relatively high level of latency. A large part of the

coding process should be concerned with designing the algorithm to avoid overflow into local

17

memory. The constant memory space and texture memory (fig. 6) spaces are both cached

and provide alternatives to global/local memory storage.

Data stored within the cache memory space is more readily available. Cached memory

accesses take as long as global/local memory for the first access, but subsequent reads can

be as fast as on chip shared memory since the data has already been retrieved and remains

available in the cache. On-chip manipulations (those residing within register and shared

memory space) cost about 4-6 clock cycles per instruction; accessing data in local memory

requires 400-600 clock cycles. A two order of magnitude performance degradation is generally

unaccceptable and requires careful manipulation of the algorithm to store as much data on

chip as possible and ensure that if data must be read or written to global memory it is

coalesced. Coalesced accesses occur when the memory addresses which are written to are

sequentially ordered and occur in multiples of 32 (the warp size).

Figure 7 captures the essence of the GPU programming model. The algorithm is par-

titioned into a kernel grid. The kernel is the function launched on the GPU. The kernel

is composed of a grid of blocks. The SIMD model mandates each block should be a self

contained set of threads which execute the kernel algorithm. The classic example is matrix

multipication where each block loads a square segment of the matrix into the on chip shared

memory. After the square slabs of the matrix elements have been loaded into the on-block

shared memory, each block loops over the rows and columns of the data subset. When each

block has finished its partial summation then there is a final sum reduction between blocks,

by reading and writing to global memory, to generate the final product matrix. The GPU

code used for this project has been included in Appendix B and follows a similar algorithmic

route to the matrix multiplication: that is, work is partitioned between blocks and finally

written back to global memory. The attached code may give a better understanding of what

is meant by blocks and threads and how these organizing principles can be used to write

parallel code that is executed on the GPU.

5 Method: ERI Implementation on GPU

Typical implentation on a CPU of the Electron Repulsion Integrals is done in a serial fashion

with four outer loops sequencing through all the unique combinations of electron orbitals

18

Figure 5: A more detailed depiction of the GPU device architecture. Showing the passage of data and theinstruction set from the Host CPU to the GPU and into the global memory space. This figure also displaysthe partitioned layout of the GPU for fast processing of data and the load/store two way communicationbetween the on chip GPU shared memory space and the off chip slow access global memory space.

Figure 6: The GPU memory model. This demonstrates the layout of a block partitioned into computationalthreads each executing the same set of instructions and the registers, shared memory, and global memorywhich each thread is capable of accessing to read/write data. [6]

19

Figure 7: Relationship between streaming multi-processors and blocks. Each of the tesla’s 30 streamingmulti-processors is composed of 8 scalar processing cores which can each execute a single block at a time.Each scalar multi-processor can execute up to 32 threads of a block at once: this is called a warp. Theinstructions are defined in the kernel and are the same for each thread in keeping with the Single InstructionMulti-Data model of parallelism. As long as there are more blocks than streaming multi-processors theperformance time will increase with an increased number of streaming multiprocessors.

20

represented as Gaussians, and four Inner loops sequencing through all shell primitive Gaus-

sians that compose the contracted Gaussian function. The GPU algorithm is significantly

different from the CPU algorithm. The following flow chart shows the sequence of compu-

tations necessary to compute the ERIs on the GPU it also highlights where control passes

from the CPU to the GPU:

Calculate Pre-

factors on CPU:

Equations 24 to 26

On CPU

Copy Pre-factors

onto GPU

Each block loads ap-

propriate pre-factors

into shared memory.

On GPU

Each block exe-

cutes the kernel to

calculate a unique

ERI (cf. 3.0.3)

Integrals translated

from Cartesian

co-ordinates to

Spherical Harmonics

Values Calculated

for ERIs trans-

ferred back to CPU

To port this computation to the GPU requires a scheme which maps the CPU computation

to the GPU in a way which effectively uses the GPU resources. The kernel runs a modified

McMurchie-Davidson algorithm which is just the classical McMurchie-Davidson algorithm

along with a scheme for dictating how to efficiently distribute the individual ERIs between

the different blocks on the GPU. The mapping used in this project is the same as one

used by Ufimtsev and Martinez[3]. They present three different schemes: the One-Thread

One-Contracted integral scheme, the One-Block One-Contracted integral scheme and the

One-Thread One-Primitive integral scheme. All of these different mapping techniques are

illustrated in figure 8.

The one block one contracted integral scheme has been implemented in this paper and is

shown schematically in Figure 8. The one block one contracted integral has two advantages;

first it possesses a medium level of parallel granularity, in other words the work is divided over

a relatively large number of threads and blocks, and the second is it is the most convenient

scheme for transporting the current design of the existing EXCITON code to GPU. A higher

21

grain of parallelization means there are more threads in active execution. The One-Block

One-Contracted integral mapping scheme essentially states that each block will calculate

the coulombic repulsion between up to four electronic charge distributions represented as

contracted Gaussians (Equation 14). The longer the contraction length the better this algo-

rithm is meant to perform since each thread calculates the electronic repulsion between one

primitive integral with a final sum reduction on the block between each primitive integral

back into a contracted integral. If there are more primitive Gaussians making up the total

contraction then there will be more threads actively executing on each block and a higher

level of parallelism will be achieved.

The suggested mapping procedure starts by laying out all N atomic orbitals ψi, defined as

Gaussian functions: equation (15), as a square matrix. The matrix elements constructed will

then be bra and ket arrays of ψiψj. This square matrix will have a length M = N(N + 1)/2.

For example, the helium atom with a basis set of N = 3 requires a grid size M2 = 36. Due

to the (bra|ket) = (ket|bra) symmetry of the coulomb repulsion integral, only twenty one

(that is the diagonal and upper triangular part of the grid) of these integrals are unique; the

Figure 8: Different mapping algorithms of Electron repulsion integrals to device[3]. The One Block-OneContracted Integral is shown top left. Each of the block on the kernel grip calculate a single contracted ERI.Each thread of the block is calculating one of the primitive integrals in the contraction. At the end of thealgorithm there is a sum reduction to reform the contracted integral. The number of idle threads in thisscheme depend on the number of primitive integrals in the contraction.

22

remaining integrals below the diagonal are redundant. Because of the layout of the GPU, and

the intrinsic parallelism, there is minimal expected performance degradation from including

the redundant integrals. However, future implementations might seek a way to eliminate

the need to include redundant integrals so that all the processors are generating relevant

information.

The first part of the computation, which includes generating the pairs of two-electron

repulsion integrals to be calculated and the recursively calculated pre-factors (section 3.0.3),

takes a small fraction of the total time: roughly 1 part in 1000. These terms are therefore

generated on the CPU and transferred to the GPU, where each of the highly multi-threaded

processors do further manipulations of the data (details are included in the flow chart) and the

final contractions for all the electron repulsion integrals. Once this computation is completed

the data is transferred back to the CPU and the rest of the EXCITON self-consistent field

program[11] can run.

6 Results and Discussion

The CUDA visual profiler[10] and predefined device functions allow the user to time different

segments of the computation and track the time allocated to memory transfers from host to

device, device to device, and device to host. It also allows the user to time the length of

a kernel call (the time required by the GPU to calculate the ERIs). MPI (message passing

interface) contains a built in library for timing different functions which was used time the

same parts of the computation in the CPU implementation. Figure 9 compares the time for

the GPU implementation of EXCITON vs. the CPU implementation of EXCITON.

Figure 9 displays the time in mili-seconds to calculate all the electron repulsion integrals

for He 1s2, Be 1s22s2, C 1s22sp4, and Ne 1s22s22p6 systems. Each system is modeled using

three gaussian functions. When electron repulsion integrals which are zero by symmetry

are omitted, the first three atoms treated in the graph have a total of 21 unique integrals

to calculate, and the Ne has a total of 31 integrals to calculate. For the smaller systems

the CPU implementation outperforms the GPU. This is likely to be a result of the small

numerical load of the computation. A highly optimized compiler is able to fit all the data in

the CPU cache, so the computation proceeds rapidly and the GPUs multi-core advantage is

23

negligible. The neon system is the first to represent a noticeable improvement with a factor

of two performance increase.

In benchmarking the speed of different algorithms on different architectures there are a

number of considerations. While massive speed ups have been obtained using GPUs it is

important that the bench marking is fair and that the amount of work that has gone into

writing the GPU implementation of the code is comparable for the CPU implementation.

In this case the comparison is fair. The EXCITON software package has been carefully

considered and optimized over the last few years. Secondly the latest gcc 4.4.0 compiler

was used on the CPU code and it was compiled with the highest level of optimization flags

included at build time to ensure the compiler was doing everything possible to increase

performance. The GPU algorithm was also compiled with the highest optimization flags.

As has been stated, GPUs are meant to outperform CPUs only for data parallel numer-

ically intensive computations. While the properties of small atomic systems are capable of

being calculated in a data parallel fashion they are not that numerically intensive and are

unlikely to reveal the computing power of the GPU. Larger multi-atom systems would better

demonstrate speed ups. In the current trials only a very small fraction of the GPU resources

are being used.

Figure 9: Comparison of for the CPU ERI implementation vs the GPU implementation of the EXCITONcode. Displayed above are the times for calculating the electron repulsion integrals in four atomic systems.The improvement in timing on the GPU increases with the system size. For neon there is a factor of twoperformance increase with the GPU implementation. Speed ups are expected to improve as larger systemsare considered.

24

The small atomic systems so far considered run very quickly on a CPU. The fact that

any speed ups at all have been detected for small systems is promising. This suggests that

as the code is extended to do computations on larger many atom systems the relative speed

up will be even greater. Also the One-Block One-Contracted integral mapping scheme is

meant to perform better for highly contracted Gaussian integrals; that is Gaussians with a

large number of primitive integrals. At the time of writing the basis sets being used consist

of Gaussian functions with only a single primitive integral. When larger basis sets are used

there should be an additional improvement over the CPU.

Figure 10: Comparison of kernel execution time for the Helium atom and the Beryllium atom. Inte-grals 2e kernel is the name of the process for calculating the ERIs. The memcopy functions are the memorytransfer of the pre-factors onto the device and transfer of the calculated integrals from the GPU back onto theCPU. The length of the bars indicates the amount of time required for each process. The graph demonstratesthat the actual computation of the Integrals is equivalent on the GPU the time it takes to transfer all thenecessary data onto the GPU is the limiting factor and exceeds that of the actual computation. The transfertime will become less significant with increased system size and synchronized memory transfers.

In Fig. 10 the CUDA visual profiler has been used to return information about the time

required for the different phases of the GPU algorithm. These phases, time to transfer data

onto and off of the device and the actual ERI algorithm itself, reveal some points of interest

about the One-Block One-Contracted integral mapping. Firstly Fig. 10 suggests that the

algorithm is bound by the number of Gaussian functions in the basis set rather than the

total number of electrons in the system. Beryllium has two more electrons than helium and

because of the O(N4) scaling of the time to calculate all of the ERIs for beryllium this system

might be expected to take fractionally longer. However since on the GPU all these integrals

are computed in parallel there is no appreciable difference in the time it takes to calculate the

electron repulsion integrals for the two molecular systems. This performance advantage of

the GPU should continue all the way up to the number of streaming multiprocessors (Figure

7) available on the GPU; in the case of the Tesla C1060 the number is 240. Extrapolating all

the way to this number suggest the maximum theoretical speed up on the GPU over a single

25

core CPU is in the area of 200x. To test this would require full streaming multiprocessor

occupancy and large molecular systems.

Fig. 10 demonstrates one of the limitations of the proposed scheme. While the actual

calculation of the electron repulsion integrals is comparable or faster in the current GPU

implementation, transferring the necessary pre-factors into the GPU’s memory space and

then, when calculations are complete, back into the CPU’s memory space takes a prohibitive

amount of time. In fact for the small atomic systems the amount of time taken to transfer

the pre-factors onto the device requires a considerable amount of time relative to the actual

calculation. This represent a prohibitive bottle neck: the speed up on the GPU does not

compensate for the additional time required to transfer data from the host CPU onto the

GPU and back again. However this is only the case for small systems. The amount of data

transfer time will become less significant proportionally as the number of electron repulsion

integrals increases with the molecular system under consideration. Furthermore, in recent

NVIDIA releases of the CUDA toolkit, it is possible to write code which transfers data onto

and off of the GPU while the GPU is running calculations. This is called synchronization

and on large molecular systems the bandwidth transfer time could be completely hidden by

simultaneously calculating the electron repulsion integrals on the device while transferring

the pre-calculated factors into the device’s global memory. This simultaneous loading of

necessary pre-factors for the a certain batch of integrals would completely hide any latency.

A second anticipated problem with the chosen scheme is that ERIs involving higher angular

momentum wave functions will require much more time to run then lower angular momentum

wave functions. This means at any one time a number of processors will be occupied with

blocks that have finished their work cycle while they wait for other blocks, with higher angular

momentum wave functions to finish their computational cycle. The next batch of integrals

can’t be loaded until the entire first batch is finished and could cause some performance

degradation.

7 Conclusion

As has been discussed, the unique memory model and the number of processing cores per

GPU means that the computing potential is available to make significant improvements in the

26

time requirements of different tasks. However, harnessing this power can be difficult. Devel-

opment using the CUDA programming language and the architecture of the GPU can prove

recalcitrant at times. Moreover, parallel programming introduces a number of complications

not present when writing serial code. For instance, a serial code will fail in a predictable and

consistent way; where as, parallel code may work nine times out of ten but fail on the tenth

run. This failure is often due to the random order in which different blocks are executed and

often this execution order will result in unpredictable consequences. However, as the coding

of the GPU implementation in this project progressed the GPU framework became much

easier to understand and development made more significant progress.

The field of CUDA and GPU programming is also very young. The Tesla C1060 device

used in this project is the first NVIDIA GPU that supports double precision computation so

there is relatively little established literature and reference material to aid a new programmer.

Even over the two months of the project new and important software for debugging applica-

tions, and visualizing how data was distributed in the GPU memory space became available

great simplifying error correction. Nor is the CUDA programming language by any means

static. The developers at NVIDIA and individual groups are ensuring that programming

languages for compute applications on GPUs are constantly evolving. Python, a natural and

popular language for numerical computing, is being implemented on GPU devices under the

name PYCUDA[8]. There are also movements towards an open source standardized language

of computing on GPUs. This language is less advanced than the CUDA framework at the

current time but is attracting a great deal of interest. The language is being developed un-

der the name of open compute language (OpenCL) analagous to the current Open Graphics

language already in use for using GPUs for their native graphical capabilities. The rapidly

evolving nature of this field means that even over a time scale of two months advances are

being made to ease the transportation of CPU code onto the GPU and fully exploit the

processing power of the GPU.

As a final note on the directions opened up by GPU architectures: Ufimtsev and Martinez

have developed programs that can calculate the ERIs and the exchange integrals entirely

on the GPU along with the rest of a self consistent field algorithm [5] .This has allowed

them to do ab initio molecular dynamics simulations of meaningful many body molecular

systems over practical time scales. The ability to calculate ab initio molecular dynamics on

27

what is essentially a desktop computing machine is a non-trivial result. Using the linear

algebra packages on the GPU which are becoming highly optimized [16] to continue the

self-consistent field algorithms and writing code that uses the GPUs native ability to display

graphics could be used to speed up how quickly the EXCITON software package runs its own

self-consistent field programs, and generate Fermi surfaces and band structure of materials.

As users gain maturity and experience with GPUs, the compiler and debug software improves,

and the programming language becomes standardized, the processing powers of the GPU will

become essential for applications in computational materials science by providing fast data

processing and sophisticated visualizations techniques.

At the completion of the project a factor of two speed up for calculating the Electron Re-

pulsion Integrals was achieved. This is a very preliminary result. There remains a substantial

amount of coding and optimization to be done and a more conclusive result would rely on

the program computing larger, more complex molecular systems with more highly contracted

basis sets to see more impressive speed ups. Obtaining the two orders of magnitude perfor-

mance increases observed elsewhere [3] will rely on this optimization of the existing GPU

code and applying it to larger molecular systems. With the increasing sophistication of GPU

technology and software, and the promising indications of observed speed-ups on the exceed-

ingly small atomic systems considered here, there is hope that with more time the GPU

implementation of EXCITON in essentially the same form, will significantly outperform the

CPU implementation.

8 Acknowledgement

I’d like to acknowledge Dr. Charles Patterson for coding the EXCITON software package,

and for his useful advice and instruction.

28

A NVIDIA Hardware Specifications

Device 0: "Tesla C1060"

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default

Bandwidth Transfer

Tesla C1060

Quick Mode

Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2959.5

Quick Mode

Device to Host Bandwidth for Pageable memory


33554432 2582.1

Quick Mode

Device to Device Bandwidth


33554432 73355.5

29

B GPU Kernel Code

//ONE BLOCK ONE CONTRACTED INTEGRAL SCHEME (handles up to P Orbitals)

#include <stdio.h>

#include <cutil.h>

#include "ATOM_SCF_kernel.h"

#include "USER_DATA.h"

#include "myconstants.h"

__device__ void f000m_device(int , double*, double , double , int, double*);

__global__ void integrals_2e_kernel(DEVICE *device, INTEGRAL_LIST* integral_list,

SHELL *shells, double *d_F, double *d_fgtuvfinal)

VECTOR_DOUBLE r_12;

__shared__ double *p_F_temp;

__shared__ double *p_F_sh;

__shared__ double F_temp[784];

__shared__ double F_sh[225];

__shared__ double c1fac[3][3][3];

__shared__ double s_sab[1];

__shared__ double s_scd[1];

__shared__ double s_pab[1];

__shared__ double s_pcd[1];

__shared__ double s_fac1[1];

__shared__ double s_pinv[1];

__shared__ double s_c1x[12];

__shared__ double s_c2x[12];

__shared__ double f[5][5][5][5];

__shared__ double en[1][55];

__shared__ int tpupvp2, tpupvpsign;

__shared__ int tuv[8][6][3];

__shared__ double s_fgtuv[125];

int i1, j1, k1, l1;

int i, j, k, l, n;

double seven = 7.0000000;

double five = 5.0000000;

int count = 0;

int counter1, counter2;

int n1, n2, n3, n4, n5, n6;

int n7, n8, n9, n10, n11, n12;

int dime1, dime2, dime3, dime4;

int lim_ij, lim_kl;

int slim_ij, slim_jk, slim_kl;

int sheli, shelj, shelk, shell;

int sheli1, shelj1, shelk1, shell1;

int nsheli, nshelj, nshelk, nshell;

int op, oppshift1, oppshift2, oppshift3, oppshift4;

int *p_i, *p_j, *p_k, *p_l, *p_i1, *p_j1, *p_k1, *p_l1;

int index, index_i, index_j, index_k, index_l, i4, j4, k4, l4;

int sheli_lim, shelj_lim, shelk_lim, shell_lim;

int t, u, v, m;

int tp, up, vp;

int tmax, umax, vmax;

int tpmax, upmax, vpmax;

double *p_rot1, *p_rot2, *p_rot3, *p_rot4;

double tmp;

double fac, fac0, fac1, fac2, fac3;

int bfposi, bfposj, bfposk, bfposl;

30

int bfposi1, bfposj1, bfposk1, bfposl1;

int gausposi, gausposj, gausposk, gausposl;

int imax, kmax, lmax, jmax;

int mm = 0;

bfposk1 = 0; bfposl1 = 0; bfposi1 = 0; bfposj1 = 0;

for (t = 0; t< 3; t++)

for (u = 0; u < 3; u++)

for (v = 0; v < 3; v ++)

c1fac[t][u][v] = 0;

for(i = 0; i < 5; i++)

for(j = 0; j < 5; j++)

for(n = 0; n < 5; n++)

for(t = 0; t < 5; t++)

f[i][j][n][t] = 0;

for (n = 0; n < 55; n ++) en[0][n] = 0.0;

if(blockIdx.y == 0 || blockIdx.y == 1 ||blockIdx.y == 2) index_i = 0;

if(blockIdx.y == 3 || blockIdx.y == 4) index_i = 1;

if(blockIdx.y == 1 || blockIdx.y == 3) index_j = 1;

if(blockIdx.y == 0) index_j = 0;



if(blockIdx.y == 5)

index_i = 2;

index_j = 2;

if(blockIdx.x == 0) index_l =0;

if (blockIdx.x == 0 || blockIdx.x == 1 || blockIdx.x ==2) index_k = 0;

if (blockIdx.x == 3 || blockIdx.x == 4) index_k = 1;

if (blockIdx.x == 1 || blockIdx.x == 3) index_l = 1;

if (blockIdx.x == 2 || blockIdx.x == 4) index_l = 2;

if (blockIdx.x == 5)

index_k = 2;

index_l = 2;

sheli = shells->type1_sh[index_i];

sheli1 = shells->type_sh[index_i];

imax = shells->imax_sh[index_i];

sheli_lim = sheli;

for (index = 0; index < index_i; index ++)

bfposj += shells->type1_sh[index_i - index - 1];

bfposj1 += shells->type_sh[index_i - index - 1];

gausposj += shells->ng_sh[index_i - index - 1];

shelj = shells->type1_sh[index_j];

shelj1 = shells->type_sh[index_j];

jmax = shells->imax_sh[index_j];

shelj_lim = shelj;

shelk = shells->type1_sh[index_k];

shelk1 = shells->type_sh[index_k];

kmax = shells->imax_sh[index_k];

shelk_lim = shelk;

shell = shells->type1_sh[index_l];

shell1 = shells->type_sh[index_l];

lmax = shells->imax_sh[index_l];

shell_lim = shell;

int counter;

for (counter = 0; counter < 1; counter++)

if ( (index_k == index_i && index_l < index_j) || (((imax + jmax) / 2 ) * 2 == imax + jmax) && (((kmax + lmax + 1) / 2 ) * 2 == kmax + lmax + 1) ||

(((imax + jmax + 1) / 2 ) * 2 == imax + jmax + 1) && (((kmax + lmax) / 2 ) * 2 == kmax + lmax)

) continue;

if(blockIdx.y > blockIdx.x) continue;

31

mm = imax + jmax + kmax + lmax;

r_12.comp1 = 1.0;

r_12.comp2 = 1.0;

r_12.comp3 = 1.0;

s_sab[0] = device->sab[blockIdx.y + threadIdx.x];

s_pab[0] = device->pab[blockIdx.y + threadIdx.x];

s_scd[0] = device->scd[blockIdx.x + threadIdy.y];

s_pcd[0] = device->pcd[blockIdx.x + threadIdy/y];

s_fac1[0] = s_sab[0] * s_scd[0];

s_pinv[0] = (1.0 / s_pab[0]) + (1.0 / s_pcd[0]);

f000m_device(25, &en[0][0], 0.0, (1.0/ (*s_pinv)), mm, d_F);

for (n = 0; n <= mm; n++)

f[0][0][0][n] = en[0][n];

for (n = 0; n <= mm; n++)

f[1][0][0][n] = r_12.comp1 * en[0][n + 1];

for (n = 0; n <= mm; n++)

f[0][1][0][n] = r_12.comp2 * en[0][n + 1];

for (n = 0; n <= mm; n ++)

f[0][0][1][n] = r_12.comp3 * en[0][n + 1];

for (n = 0; n <= mm - 1; n++)

f[1][1][0][n] = r_12.comp1 * f[0][1][0][n + 1];

for (n = 0; n <= mm - 1; n++)

f[1][0][1][n] = r_12.comp1 * f[0][0][1][n + 1];

for (n = 0; n <= mm - 1; n++)

f[0][1][1][n] = r_12.comp2 * f[0][0][1][n + 1];

for (n = 0; n <= mm - 1; n++)

f[2][0][0][n] = r_12.comp1 * f[1][0][0][n + 1] + f[1][0][0][n + 1];

for (n = 0; n <= mm - 1; n++)

f[0][2][0][n] = r_12.comp2 * f[0][1][0][n + 1] + f[0][1][0][n + 1];

for (n = 0; n <= mm - 1; n++)

f[0][0][2][n] = r_12.comp3 * f[0][0][1][n + 1] + f[0][0][1][n + 1];

for (n = 0; n <= mm - 2; n++)

f[2][1][0][n] = r_12.comp1 * f[1][1][0][n + 1] + f[0][1][0][n + 1];

for (n = 0; n <= mm - 2; n++)

f[2][0][1][n] = r_12.comp1 * f[1][0][1][n + 1] + f[0][0][1][n + 1];

for (n = 0; n <= mm - 2; n++)

f[1][2][0][n] = r_12.comp2 * f[1][1][0][n + 1] + f[1][0][0][n + 1];

for (n = 0; n <= mm - 2; n++)

f[0][2][1][n] = r_12.comp2 * f[0][1][1][n + 1] + f[0][0][1][n + 1];

for (n = 0; n <= mm - 2; n++)

f[0][1][2][n] = r_12.comp3 * f[0][1][1][n + 1] + f[0][1][0][n + 1];

for (n = 0; n <= mm - 2; n++)

f[1][0][2][n] = r_12.comp3 * f[1][0][1][n + 1] + f[1][0][0][n + 1];

for (n = 0; n <= mm - 2; n++)

f[1][1][1][n] = r_12.comp1 * f[0][1][1][n + 1];

for (n = 0; n <= mm - 2; n++)

f[3][0][0][n] = r_12.comp1 * f[2][0][0][n + 1] + two * f[1][0][0][n + 1];

for (n = 0; n <= mm - 2; n++)

f[0][3][0][n] = r_12.comp2 * f[0][2][0][n + 1] + two * f[0][1][0][n + 1];

for (n = 0; n <= mm - 2; n++)

f[0][0][3][n] = r_12.comp3 * f[0][0][2][n + 1] + two * f[0][0][1][n + 1];

for (n = 0; n <= mm - 3; n++)

f[2][1][1][n] = r_12.comp1 * f[1][1][1][n + 1] + f[0][1][1][n + 1];

for (n = 0; n <= mm - 3; n++)

f[1][2][1][n] = r_12.comp2 * f[1][1][1][n + 1] + f[1][0][1][n + 1];

32

for (n = 0; n <= mm - 3; n++)

f[1][1][2][n] = r_12.comp3 * f[1][1][1][n + 1] + f[1][1][0][n + 1];

for (n = 0; n <= mm - 3; n++)

f[2][2][0][n] = r_12.comp1 * f[1][2][0][n + 1] + f[0][2][0][n + 1];

for (n = 0; n <= mm - 3; n++)

f[2][0][2][n] = r_12.comp1 * f[1][0][2][n + 1] + f[0][0][2][n + 1];

for (n = 0; n <= mm - 3; n++)

f[0][2][2][n] = r_12.comp2 * f[0][1][2][n + 1] + f[0][0][2][n + 1];

for (n = 0; n <= mm - 3; n++)

f[3][1][0][n] = r_12.comp1 * f[2][1][0][n + 1] + two * f[1][1][0][n + 1];

for (n = 0; n <= mm - 3; n++)

f[3][0][1][n] = r_12.comp1 * f[2][0][1][n + 1] + two * f[1][0][1][n + 1];

for (n = 0; n <= mm - 3; n++)

f[0][3][1][n] = r_12.comp2 * f[0][2][1][n + 1] + two * f[0][1][1][n + 1];

for (n = 0; n <= mm - 3; n++)

f[1][3][0][n] = r_12.comp2 * f[1][2][0][n + 1] + two * f[1][1][0][n + 1];

for (n = 0; n <= mm - 3; n++)

f[1][0][3][n] = r_12.comp3 * f[1][0][2][n + 1] + two * f[1][0][1][n + 1];

for (n = 0; n <= mm - 3; n++)

f[0][1][3][n] = r_12.comp3 * f[0][1][2][n + 1] + two * f[0][1][1][n + 1];

for (n = 0; n <= mm - 3; n++)

f[4][0][0][n] = r_12.comp1 * f[3][0][0][n + 1] + three * f[2][0][0][n + 1];

for (n = 0; n <= mm - 3; n++)


for (n = 0; n <= mm - 3; n++)


for (n = 0; n <= mm - 4; n++)

f[2][2][1][n] = r_12.comp1 * f[1][2][1][n + 1] + f[0][2][1][n + 1];

for (n = 0; n <= mm - 4; n++)

f[2][1][2][n] = r_12.comp1 * f[1][1][2][n + 1] + f[0][1][2][n + 1];

for (n = 0; n <= mm - 4; n++)

f[1][2][2][n] = r_12.comp2 * f[1][1][2][n + 1] + f[1][0][2][n + 1];

for (n = 0; n <= mm - 4; n++)

f[3][2][0][n] = r_12.comp1 * f[2][2][0][n + 1] + two * f[1][2][0][n + 1];

for (n = 0; n <= mm - 4; n++)

f[3][0][2][n] = r_12.comp1 * f[2][0][2][n + 1] + two * f[1][2][0][n + 1];

for (n = 0; n <= mm - 4; n++)

f[0][3][2][n] = r_12.comp2 * f[0][2][2][n + 1] + two * f[0][1][2][n + 1];

for (n = 0; n <= mm - 4; n++)

f[2][3][0][n] = r_12.comp2 * f[2][2][0][n + 1] + two * f[2][1][0][n + 1];

for (n = 0; n <= mm - 4; n++)

f[2][0][3][n] = r_12.comp3 * f[2][0][2][n + 1] + two * f[2][0][1][n + 1];

for (n = 0; n <= mm - 4; n++)

f[0][2][3][n] = r_12.comp3 * f[0][2][2][n + 1] + two * f[0][2][1][n + 1];

for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)

f[5][0][0][n] = r_12.comp1 * f[4][0][0][n + 1] + four * f[3][0][0][n + 1];

33

for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 4; n++)


for (n = 0; n <= mm - 5; n++)

f[2][2][2][n] = r_12.comp1 * f[1][2][2][n + 1] + f[0][2][2][n + 1];

for (n = 0; n <= mm - 5; n++)

f[3][2][1][n] = r_12.comp1 * f[2][2][1][n + 1] + two * f[1][2][1][n + 1];

for (n = 0; n <= mm - 5; n++)

f[3][1][2][n] = r_12.comp1 * f[2][1][2][n + 1] + two * f[1][1][2][n + 1];

for (n = 0; n <= mm - 5; n++)

f[1][3][2][n] = r_12.comp2 * f[1][2][2][n + 1] + two * f[1][1][2][n + 1];

for (n = 0; n <= mm - 5; n++)

f[2][3][1][n] = r_12.comp2 * f[2][2][1][n + 1] + two * f[2][1][1][n + 1];

for (n = 0; n <= mm - 5; n++)

f[2][1][3][n] = r_12.comp3 * f[2][1][2][n + 1] + two * f[2][1][1][n + 1];

for (n = 0; n <= mm - 5; n++)

f[1][2][3][n] = r_12.comp3 * f[1][2][2][n + 1] + two * f[1][2][1][n + 1];

for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)

f[6][0][0][n] = r_12.comp1 * f[5][0][0][n + 1] + five * f[4][0][0][n + 1];

for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 5; n++)


for (n = 0; n <= mm - 6; n++)

f[3][2][2][n] = r_12.comp1 * f[2][2][2][n + 1] + two * f[1][2][2][n + 1];

for (n = 0; n <= mm - 6; n++)

f[2][3][2][n] = r_12.comp2 * f[2][2][2][n + 1] + two * f[2][1][2][n + 1];

for (n = 0; n <= mm - 6; n++)

f[2][2][3][n] = r_12.comp3 * f[2][2][2][n + 1] + two * f[2][2][1][n + 1];

for (n = 0; n <= mm - 6; n++)

f[3][3][1][n] = r_12.comp1 * f[2][3][1][n + 1] + two * f[1][3][1][n + 1];

for (n = 0; n <= mm - 6; n++)

f[3][1][3][n] = r_12.comp1 * f[2][1][3][n + 1] + two * f[1][1][3][n + 1];

for (n = 0; n <= mm - 6; n++)

f[1][3][3][n] = r_12.comp2 * f[1][2][3][n + 1] + two * f[1][1][3][n + 1];

for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)

34


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)


for (n = 0; n <= mm - 6; n++)

f[7][0][0][n] = r_12.comp1 * f[6][0][0][n + 1] + six * f[5][0][0][n + 1];

for (n = 0; n <= mm - 6; n++)

f[0][7][0][n] = r_12.comp2 * f[0][6][0][n + 1] + six * f[0][5][0][n + 1];

for (n = 0; n <= mm - 6; n++)

f[0][0][7][n] = r_12.comp3 * f[0][0][7][n + 1] + six * f[0][0][5][n + 1];

for (i1 = 0; i1 <= mm; i1++)

for (j1 = 0; j1 <= mm; j1++)

for(k1 = 0; k1 <= mm; k1++)

s_fgtuv[i1 * ((mm+1)*(mm+1)) + j1 * (mm + 1) + k1] = s_fac1[0] * f[i1][j1][k1][0];

p_F_temp = F_temp;

for (i = 0; i < sheli * shelj *shelk * shell; i++)

*p_F_temp = zero;

p_F_temp++;

dime4 = shell_lim ;

dime3 = dime4*shelk_lim ;

dime2 = dime3*shelj_lim ;

dime1 = dime2*sheli_lim ;

p_F_sh = F_sh;

for (i = 0; i < sheli1 * shelj1 *shelk1 * shell1; i++)

*p_F_sh = zero;

p_F_sh++;

for (i = 0; i < 8; i++)

for (j = 0; j < 6; j++)

for (k = 0; k < 3; k++)

35

tuv[i][j][k] = 0;

for (i = 0; i < sheli_lim; i++)

slim_ij = 0;

if (index_i == index_j) slim_ij = i;

for (j = slim_ij; j < shelj_lim; j++)

tmax = tuv[sheli_lim][i][0] + tuv[shelj_lim][j][0];

umax = tuv[sheli_lim][i][1] + tuv[shelj_lim][j][1];

vmax = tuv[sheli_lim][i][2] + tuv[shelj_lim][j][2];

n1 = tuv[sheli_lim][i][0];



n4 = tuv[shelj_lim][j][0];



slim_jk = 0;

for (k = 0; k < shelk_lim; k++)

slim_kl = 0;

if (index_k == index_l) slim_kl = k;

for (l = slim_kl; l < shell_lim; l++)

if (index_i == index_k && index_j == index_l && i * shelj_lim + j > k * shell_lim + l)

continue;

tpmax = tuv[shelk_lim][k][0] + tuv[shell_lim][l][0];

upmax = tuv[shelk_lim][k][1] + tuv[shell_lim][l][1];

vpmax = tuv[shelk_lim][k][2] + tuv[shell_lim][l][2];

n7 = tuv[shelk_lim][k][0];



n10 = tuv[shell_lim][l][0];



int ijmax = (imax + jmax + 1);

int klmax = (kmax + lmax + 1);

counter1 = 0; counter2 = 0;

for (t = 0; t <= tmax; t++)

for (u = 0; u <= umax; u++)

for (v = 0; v <= vmax; v++)

counter2 = 0;

for (tp = 0; tp <= tpmax; tp++)

for (up = 0; up <= upmax; up++)

for (vp = 0; vp <= vpmax; vp++)

tpupvpsign = -one;

tpupvp2 = tp + up + vp + 2;

if ((tpupvp2 / 2) * 2 == tpupvp2)

tpupvpsign = one;

c1fac[t][u][v] += s_fgtuv[(t + tp)*((mm+1)*(mm+1)) + (u+up)*(mm+1) + (v+vp)] * s_c2x[tp *(lmax + 1)*(kmax + 1) + n7 * (lmax + 1) + n10]\

* s_c2x[up * (kmax + 1)* (lmax + 1) * + n8 * (lmax + 1) + n11]\

* s_c2x[vp * ( lmax + 1) * (kmax + 1) + n9 * (lmax + 1) + n12]\

* tpupvpsign; // end t u v loop

for (t = 0; t <= tmax; t++)

for (u = 0; u <= umax; u++)

for (v = 0; v <= vmax; v++)

p_F_temp = F_temp + dime2 * i + dime3 * j + dime4 * k + l ;

*p_F_temp += c1fac[t][u][v] * s_c1x[t *(imax + 1)*(jmax + 1) + n1 * (jmax + 1) + n4]\

*s_c1x[u * (imax + 1) * (jmax + 1) * + n2 * (jmax + 1) + n5]\

*s_c1x[v * (imax + 1) * (jmax + 1) + n3 * (jmax + 1) + n6];

// end ijkl loop

nsheli = *(shells->num_ij + shells->ord_sh[index_i]);

nshelj = *(shells->num_ij + shells->ord_sh[index_j]);

nshelk = *(shells->num_ij + shells->ord_sh[index_k]);

36

nshell = *(shells->num_ij + shells->ord_sh[index_l]);

p_i1 = shells->ind_i + shells->opp_sh[index_i];

p_i = shells->ind_j + shells->opp_sh[index_i];

p_rot1 = shells->rot + shells->opp_sh[index_i];

for (i = 0; i < nsheli; i++)

p_j1 = shells->ind_i + shells->opp_sh[index_j];

p_j = shells->ind_j + shells->opp_sh[index_j];

p_rot2 = shells->rot + shells->opp_sh[index_j];

for (j = 0; j < nshelj; j++)

p_k1 = shells->ind_i + shells->opp_sh[index_k];

p_k = shells->ind_j + shells->opp_sh[index_k];

p_rot3 = shells->rot + shells->opp_sh[index_k];

for (k = 0; k < nshelk; k++)

p_l1 = shells->ind_i + shells->opp_sh[index_l];

p_l = shells->ind_j + shells->opp_sh[index_l];

p_rot4 = shells->rot + shells->opp_sh[index_l];

for (l = 0; l < nshell; l++)

i1 = *p_i1;

j1 = *p_j1;

k1 = *p_k1;

l1 = *p_l1;

fac2 = one;

if (index_i == index_j && *p_i > *p_j) fac2 = zero;

if (index_k == index_l && *p_k > *p_l) fac2 = zero;

if (index_i == index_k && index_j == index_l && *p_i * shelj1 + *p_j > *p_k * shell1 + *p_l) fac2 = zero;

if (index_i == index_j && *p_i1 > *p_j1) j1 = *p_i1; i1 = *p_j1;

if (index_k == index_l && *p_k1 > *p_l1) l1 = *p_k1; k1 = *p_l1;

if (index_i == index_k && index_j == index_l && i1 * shelj + j1 > k1 * shell + l1)

tmp = i1; i1 = k1; k1 = tmp; tmp = j1; j1 = l1; l1 = tmp;

p_F_temp = F_temp + i1 * shelj * shelk * shell + j1 * shelk * shell + k1 * shell + l1;

p_F_sh = F_sh + *p_i * shelj1 * shelk1 * shell1 + *p_j * shelk1 * shell1 + *p_k * shell1 + *p_l;

*p_F_sh += fac2 * *p_F_temp * *p_rot1 * *p_rot2 * *p_rot3 * *p_rot4;

p_l++;

p_l1++;

p_rot4++;

p_k++;

p_k1++;

p_rot3++;

p_j++;

p_j1++;

p_rot2++;

p_i++;

p_i1++;

p_rot1++;

p_F_sh = F_sh;

for (i = 0; i < sheli1; i++)

for (j = 0; j < shelj1; j++)

for (k = 0; k < shelk1; k++)

for (l = 0; l < shell1; l++)

fac = one;

if (index_i == index_j && i == j) fac /= two;

if (index_k == index_l && k == l) fac /= two;

if ((index_i == index_k && index_j == index_l) && (i == k && j == l)) fac /= two;

if (fabs(*p_F_sh) > 1e-09)

d_fgtuvfinal[blockIdx.x + blockIdx.y * 6] = *p_F_sh * fac;

count++;

p_F_sh++;

37

References

[1] Boys, S. F. Electronic Wave Functions. I. A General Method of Calculation for the Stationary States of AnyMolecular System. Proceedings Royal Society London. Ser A 1950, 200, 542

[2] McMurchie, L. E.; Davidson, E.R. One- and Two-electron Integrals over Cartesian Gaussian Functions Journalof Computational Physics. 1978, 26, 218.

[3] Ufimtsev, I. S.; Martinez, J. T. Quantum Chemistry on Graphical Processing Units 1. Strategies for Two-ElectronIntegral Evaluation. Journal Chemical Theory and Computation. 2008, 4, 222-231.

[4] Ufimtsev, I. S.; Martinez, J. T. Quantum Chemistry on Graphical Processing Units 2. Direct Self-Consisten-FieldImplementation. Journal Chemical Theory and Computation. 2009, 5, 1004-1015.

[5] Ufimtsev S. I.; Martinez, J. T. Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradi-ents, Geometry Optimization, and First Principles Molecular Dynamics June 11, 2009

[6] Kirk, David.; Hwu, Wen-mei. Draft Cuda-Textbook: Programming Massively Parallel Processors. 2006-2008

[7] Moore, Gordon E. (1965). “Cramming more components onto integrated circuits”. Electronics Magazine. pp. 4.1965.

[8] Garg, Rahul. Masters Thesis: A compiler for parallel execution of numerical Python programs on graphicsprocessing units. Fall 2009. Department of Computer Science University of Alberta.

[9] NVIDIA CUDA C Programming Best Practices Guide Cuda Toolkit 2.3 July 2009.

[10] NVIDIA CUDA Programming Guide. Version 2.3.1 August 26 2009

[11] Almlof, J.; Faegri, K; Korsell, K. Principles for a direct SCF approach to LCAO-MO ab initio calculations. J.Computational Chemistry. 1982, 3, 385.

[12] Hohenberg, P.; Kohn, W. Inhomogeneous Electron Gas 1964 Phys. Rev. 136 B864.

[13] Szabo, A.; Ostlund, S. N. Modern Quantum Chemistry: Introduction to Advanced Electronic Structure TheoryMcGraw-Hill Inc. 1989

[14] Hedin, L. New Method for Calculating the One-Particle Green’s Function with Application to the Electron-GasProblemPhys. Rev. 139, A796(1965)

[15] Parr G. Robert.; Yang, Weitao.Density-Functional Theory of Atoms and Molecules. Oxford University Press.1989.

[16] Demmel, W. J.; Volkov, V. Benchmarking GPUs to Tune Dense Linear Algebra. Conference on High PerformanceNetworking and Computing. November 2008.

[17] Shi, Guochun. Implementation of Scientific Computing Applications on the Cell Broadband Engine processor.NCSA University of Illinois at Urbana-Champaign.

[18] Hill, D. M.; Marty, R. M.; Amdahl’s Law in the Multicore Era. University of Wisconsin. Google. July 2008.

[19] Amarasinghe, Saman. Introduction to Multicore Programming. Lecture 1. January 2007. MIT.

[20] Farber, Rob. textslCUDA, Supercomputing for the Masses. 15 April 2008.

[21] Pople, J. A.; Segal, G. A. Approximate Self-consistent Molecular Orbital Theory III. CNDO Results for AB2and AB3 systems, J. Chem. Phys . 44 (1966) 3289

[22] Figure from http://en.wikipedia.org/wiki/Amdahl’s_law. Creative Commons Licence

38

calculating electron repulsion integrals on gpu and cpu...

Documents