performance tuning of n-body codes on modern microprocessors: i. direct integration with a hermite...

13
Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture Keigo Nitadori a, * , Junichiro Makino b , Piet Hut c a Department of Astronomy, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan b National Astronomical Observatory of Japan, Mitaka, Tokyo 181-8588, Japan c Institute for Advanced Study, Princeton, NJ 08540, USA Received 3 November 2005; received in revised form 16 July 2006; accepted 18 July 2006 Available online 8 September 2006 Communicated by G.F. Gilmore Abstract The main performance bottleneck of gravitational N-body codes is the force calculation between two particles. We have succeeded in speeding up this pair-wise force calculation by factors between 2 and 10, depending on the code and the processor on which the code is run. These speed-ups were obtained by writing highly fine-tuned code for x86_64 microprocessors. Any existing N-body code, running on these chips, can easily incorporate our assembly code programs. In the current paper, we present an outline of our overall approach, which we illustrate with one specific example: the use of a Hermite scheme for a direct N 2 type integration on a single 2.0 GHz Athlon 64 processor, for which we obtain an effective performance of 4.05 Gflops, for double-precision accuracy. In subsequent papers, we will discuss other variations, including the combinations of N log N codes, single-precision implementations, and performance on other microprocessors. Ó 2006 Elsevier B.V. All rights reserved. PACS: 98.10.+z Keywords: Stellar dynamics; Methods: numerical 1. Introduction Some N-body simulations can be sped up in various ways, by using faster algorithms such as tree codes (Barnes and Hut, 1986) and/or special purpose hardware such as the GRAPE family (Sugimoto et al., 1990, 2003, 2005). For some regimes, such as low N values, these speed-up methods are not very efficient, and it would be nice to find other ways to improve the speed of such calculations. It would be even better if these alternative ways can be com- bined with other methods of speed-up. We explore here a general approach based on speeding up the inner loop of gravitational force calculations, namely the interactions between one pair of particles. Also when using tree codes this approach will be useful, since in that case the calculation cost is still dominated by force calcula- tions. Even for GRAPE applications, this approach will still be useful in many cases, since there are always some calcu- lations which are done more efficiently on the front end. In particular, we consider the optimization of the inner force loop on the x86-64 (or AMD64 or EM64T) architec- ture, the newest incarnation of the architecture that origi- nated with the Intel 8080 microprocessor. Processors with an x86-64 instruction set are currently the most widely used. Athlon 64 and Opteron microprocessors from AMD, and many recent models of Pentium 4 and Xeon microprocessors from Intel, support this instruction set. 1384-1076/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.newast.2006.07.007 * Corresponding author. Tel.: +81 3 58414274; fax: +81 3 58414252. E-mail address: [email protected] (K. Nita- dori). www.elsevier.com/locate/newast New Astronomy 12 (2006) 169–181

Upload: keigo-nitadori

Post on 26-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

www.elsevier.com/locate/newast

New Astronomy 12 (2006) 169–181

Performance tuning of N-body codes on modern microprocessors:I. Direct integration with a hermite scheme on x86_64 architecture

Keigo Nitadori a,*, Junichiro Makino b, Piet Hut c

a Department of Astronomy, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japanb National Astronomical Observatory of Japan, Mitaka, Tokyo 181-8588, Japan

c Institute for Advanced Study, Princeton, NJ 08540, USA

Received 3 November 2005; received in revised form 16 July 2006; accepted 18 July 2006Available online 8 September 2006Communicated by G.F. Gilmore

Abstract

The main performance bottleneck of gravitational N-body codes is the force calculation between two particles. We have succeeded inspeeding up this pair-wise force calculation by factors between 2 and 10, depending on the code and the processor on which the code isrun. These speed-ups were obtained by writing highly fine-tuned code for x86_64 microprocessors. Any existing N-body code, running onthese chips, can easily incorporate our assembly code programs.

In the current paper, we present an outline of our overall approach, which we illustrate with one specific example: the use of a Hermitescheme for a direct N2 type integration on a single 2.0 GHz Athlon 64 processor, for which we obtain an effective performance of4.05 Gflops, for double-precision accuracy. In subsequent papers, we will discuss other variations, including the combinations of N logN

codes, single-precision implementations, and performance on other microprocessors.� 2006 Elsevier B.V. All rights reserved.

PACS: 98.10.+z

Keywords: Stellar dynamics; Methods: numerical

1. Introduction

Some N-body simulations can be sped up in variousways, by using faster algorithms such as tree codes (Barnesand Hut, 1986) and/or special purpose hardware such asthe GRAPE family (Sugimoto et al., 1990, 2003, 2005).For some regimes, such as low N values, these speed-upmethods are not very efficient, and it would be nice to findother ways to improve the speed of such calculations. Itwould be even better if these alternative ways can be com-bined with other methods of speed-up.

1384-1076/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.newast.2006.07.007

* Corresponding author. Tel.: +81 3 58414274; fax: +81 3 58414252.E-mail address: [email protected] (K. Nita-

dori).

We explore here a general approach based on speedingup the inner loop of gravitational force calculations, namelythe interactions between one pair of particles. Also whenusing tree codes this approach will be useful, since in thatcase the calculation cost is still dominated by force calcula-tions. Even for GRAPE applications, this approach will stillbe useful in many cases, since there are always some calcu-lations which are done more efficiently on the front end.

In particular, we consider the optimization of the innerforce loop on the x86-64 (or AMD64 or EM64T) architec-ture, the newest incarnation of the architecture that origi-nated with the Intel 8080 microprocessor. Processors withan x86-64 instruction set are currently the most widelyused. Athlon 64 and Opteron microprocessors fromAMD, and many recent models of Pentium 4 and Xeonmicroprocessors from Intel, support this instruction set.

Page 2: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

170 K. Nitadori et al. / New Astronomy 12 (2006) 169–181

As will be shown in Section 2, a straightforward imple-mentation of the inner force loop using either Fortran or C,compiled with standard compilers like GCC or ICC forx86-64 processors, results in a performance that is signifi-cantly lower than the theoretical peak value one can expectfrom the hardware. In the following, we discuss how wecan improve the performance of the force loop on proces-sors with an x86-64 instruction set.

1.1. The SSE2 instruction set

Our approach is based on the use of new features addedto the x86 microprocessors in the last eight years. The ven-dors’ views on the use of these instruction sets are given inIntel (2006) and AMD (2004). The first one is the SSE2instruction set for double-precision floating-point arithme-tic. Traditionally, the instruction set for x86 microproces-sors has included the so-called x87 instruction set, whichwas originally designed for the 8087 math coprocessor forthe 8086 16-bit microprocessor. This instruction set is stackbased, in the sense that it does not have any explicit way tospecify registers. Instead, registers are indirectly accessed asa stack, where two operands for an arithmetic operation aretaken from the top of the stack (popped) and the result isplaced back at the top of the stack (pushed). Memory accessalso takes place through the top of the stack.

This x87 instruction set had the advantage that theinstructions are simple and few in number, but for the last15 years the design of a fast floating-point unit for this x87instruction set has been a major problem for all x86 micro-processors. If one would really design stack-based hard-ware, any pipelining would be practically impossible. Inorder to allow pipelining, current x86-based microproces-sors, from Intel as well as AMD, translate the stack-basedx87 instructions to RISC-like, presumably standard three-address register-to-register instructions in hardware at exe-cution time.

This approach has given quite high performance, cer-tainly much higher than what would have been possiblewith the original stack-based implementation. However,it was clear that pipelining and better use of hardware reg-isters would be much easier if one could use an instructionset with explicit reference to the registers. In 2001, with theintroduction of Pentium 4 microprocessors, Intel addedsuch a new floating-point instruction set, which is calledSSE2. It is still not a real three-address instruction set; itrather uses a two-address form, where the address of thesource register and that of the destination register are thesame. Moreover, SSE2 still supports operations betweenthe data in the main memory and that in the register, aswas the case with the IBM System/360. Thus, it still hasthe look and feel of an instruction set from the 1960s.

1.2. Minimizing memory access

Even though operations between operands in memoryand operands in registers are supported, clearly execution

would be much faster if all operands could reside in reg-isters. However, with the original SSE2 instruction set itwas difficult to eliminate memory access for intermediateresults, because there were only eight registers availablefor SSE2 instructions. For whatever reason, these registersare called ‘‘XMM’’ registers in the manufacturer’s docu-ment, and we follow this convention. With the newx86_64 instruction set, the number of these ‘‘XMM’’ reg-isters was doubled from 8 to 16. The implication for N-body calculations was that it now became possible to min-imize memory access during the inner force loop. A formof optimization using this approach will be discussed inSection 3.

1.3. Exploiting two-word double-precision parallelism

Another important feature of SSE2 (which is actuallyStreaming SIMD extension 2) is that it is defined to oper-ate on a pair of two 64-bit floating point words, insteadof a single floating-point word. This effectively means thatthe use of SSE2 instructions automatically result in theexecution of two floating-point operations in parallel.While this feature cannot be easily used with any com-piler-based optimization, it is possible to gain consider-able profit from this feature through judicious handcoding. We discuss the use of this parallel nature of theSSE2 instruction set in Section 4.

1.4. Exploiting four-word single-precision parallelism

SSE2 is not the only new floating-point instruction setthat has been made available for the x86 hardware. As thename SSE2 already suggests, there is an earlier SSEinstruction set, which is similar to SSE2 but works onlyon single-precision floating-point numbers. As is the casewith SSE2, SSE also works on multiple data in parallel,but instead of the two double-precision data of SSE2,SSE works on four single-precision floating-point num-bers simultaneously. Thus, the peak calculation speed ofSSE is at least a factor of two higher than that ofSSE2. For those force calculations, where single precisiongives us a sufficient degree of accuracy, we can make useof SSE, gaining a performance that is even higher thanwhat would have been possible with SSE2, as we discussin Section 5.

1.5. Utilizing built-in inverse square root instructions

SSE was designed mainly to speed-up coordinate trans-formation in three-dimensional graphics. As a result, ithas a special instruction for the very fast calculation ofan approximate inverse square root, which is intendedas a good initial value for Newton–Raphson iteration.This is exactly what we need for the calculation of gravi-tational forces. We discuss the use of this approximateinverse square root for double-precision calculation, inSection 4.

Page 3: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

K. Nitadori et al. / New Astronomy 12 (2006) 169–181 171

1.6. Related works

The explicit use of SSE/SSE2 instruction set is becomingpopular in various fields of scientific computing. Thistrend, as we stated earlier, indicates that there is a mis-match between the instruction set and the ability of compil-ers to use it.

Most notably, there are many works to useSSE/SSE2 instruction set for Lattice QCD calculations(Koma, 2005). Also, various libraries for basic mathe-matical operations such as FFTW1 for fast Fouriertransform or ATLAS2 for linear algebra support SSE/SSE2. Some of the recent versions of packages formolecular dynamics also support SSE/SSE2 (Spoelet al., 2005).

Most of these works rely heavily on the use of assem-bly codes, either in the form of routines entirely written inassembly language or inline assembly codes. In this paper,we describe the effective use of SSE/SSE2 with minimaluse of assembly codes. Also, the use of these SSE/SSE2instructions, in our case, is essentially limited to a singlefunction which calculates the gravitational interactionbetween particles. Thus, it is relatively easy to adoptour approach to any N-body code. Another difference isthat maximum speed-up we achieved (close to a factorof four on Intel Xeon platform) seems to be larger thanwhat have been reported, which is generally around a fac-tor of two.

These differences come partly from the differencein nature of the problems, and partly from ourapproach.

For gravitational N-body simulation, the GRAPE fam-ily of special-purpose computers (Sugimoto et al., 1990) dooffer price-performance ratio significantly better than thatof general-purpose microprocessors, even with the help ofnew SIMD instruction sets we describe in this paper. Evenso, for small-N calculations or modestly large-N calcula-tion on large parallel cluster computers, the efficiency ofGRAPE hardwares is relatively limited. The same is truefor tree code, since the overall speed is limited by the speedof communication between the host and GRAPEhardware.

Hamada (2005) reported the use of FPGA (field pro-grammable gate array) for N-body simulation. In termsof programmability, FPGA devices fall in between customhardwares like GRAPE and general-purpose microproces-sors. In terms of performance, they again fall in between.However, one potential advantage of FPGA is that onecan optimize the accuracy of individual operations toreduce the size of the hardware. Thus, if very low accuracyis okay, FPGA can be significantly better than either ofother two approaches.

1 http://www.fftw.org.2 http://www.netlib.org/atlas/.

1.7. Organization

This paper is organized as follows. In section 2, we givethe standard C-language implementation of the forceloop. We consider the code fragment which calculatesboth the acceleration and its first time derivative. It isused with Hermite integration scheme (Makino and Aars-eth, 1992).

We present the assembly-language output and measuredperformance, and describe possible room for improve-ments. We call these implementations the baselineimplementation.

In section 3, we discuss the optimized C-language imple-mentation of the force loop for the Hermite scheme. Thedifference from the baseline implementation is that the C-language code is hand tuned so that the load and store ofintermediate results are eliminated from the generatedassembly-language output. This version gives us thespeed-up of 46% compared to the baseline method.

In section 4, we discuss a more efficient use of SSE2instruction, where forces on two particles are calculatedin parallel using the SIMD nature of the SSE2 instructionset. Here, we use the intrinsic data types defined in GCC,which allows us to use this SIMD nature within the syntaxof C-language. Also, we use the fast approximate squareroot instruction and a Newton–Raphson iteration. Thisimplementation is 88% faster than baseline.

In section 5, we discuss a mixed SSE-SSE2 implementa-tion of the force calculation for the Hermite scheme. Inmany applications, full double-precision accuracy is notnecessary, except for the first subtraction between positionsand final accumulation of the forces. ‘‘High-accuracy’’GRAPE hardwares (GRAPE-2, 4 and 6) rely on thismixed-accuracy calculation. Thus, it is possible to performmost of the force-calculation operations using SSE single-precision instructions, thereby further speeding up theforce calculation. In this way, we can speed-up the calcula-tion by another 67% from the SSE2 parallel implementa-tion of Section 4, achieving 219% speed-up (3.19 timesfaster) from the baseline implementation.

2. Baseline implementation

2.1. Target functions

The target functions that we want to calculate are

ai ¼X

j

mjrij

ðr2ij þ e2Þ3=2

; ð1Þ

ji ¼ _ai ¼X

j

mjvij

ðr2ij þ e2Þ3=2

� 3ðvij � rijÞrij

ðr2ij þ e2Þ5=2

" #; ð2Þ

/i ¼�X

j

mj

ðr2ij þ e2Þ1=2

; ð3Þ

where ai and /i are the gravitational acceleration andthe potential of particle i, the jerk ji is the time derivative

Page 4: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

172 K. Nitadori et al. / New Astronomy 12 (2006) 169–181

of the acceleration, and ri, vi, and mi are the position,velocity and mass of particle i, rij = rj � ri andvij = vj � vi.

The calculation of a and / requires 9 multiplications,10 addition/subtraction operations, 1 division and 1square root calculation. Clearly, calculation of an inversesquare root is more expensive than addition/multiplica-tion operations. Therefore, if we want to measure thespeed of a force calculation loop in terms of the numberof floating-point operations per second, we need to intro-duce some conversion factor for division and square rootcalculations. In this paper, we use 38 as the total numberof floating point operations for the calculation of a and/. This implies that we effectively assign around 10 oper-ations to each division and to each square root calcula-tion. This particular convention was introduced byWarren et al. (1997), and we follow it here since it seemsto be a reasonable representation of the actual computa-tional costs of force calculations on typical scalarprocessors.

In addition, it requires 11 multiplications and 11 addi-tion/subtraction operations to calculate j. Thus, the totalnumber of floating-point operations per inner force loopfor the Hermite scheme can be given as 60. This is thenumber that we will use here. Note that we have previ-ously used numbers that were slightly less, by about5%, using 57 instead of 60 (Makino and Fukushige,2001).

In the case of a simple leapfrog method or a linear mul-tistep method based on divided difference table such asNBODY5 (Aarseth et al., 1963, 2003), we need to calculateonly ai and /i. For a Hermite scheme (Makino and Aars-eth, 1992) we need to determine ji as well.

2.2. Baseline implementation and its performance

The following code fragments contain what we regard asthe baseline implementation of the force calculations,where we use the word ‘force’ loosely, to indicate the calcu-lations for accelerations, jerks as well as the potential. Inother words, we consider ‘force calculations’ to compriseall the basic low-level dynamical calculations governingthe interactions between pairs of particles.

List 1. Baseline implementation calculates force on ithparticle.

Table 1Performance of the code shown in List 1 when N = 1024, in cycles-per-interac

Compiler Options

GCC 3.3.1 -03 -ffast-math -funroll-loops

GCC 3.3.1 -03 -ffast-math -funroll-loops

GCC 4.0.1 -03 -ffast-math -funroll-loops

PGI 5.1 -fastsse

EKO Path 2.0 -03 -ffast-math

ICC 9.0 -03

Line 29 in List 1 is a branch to avoid self interaction. Wemight remove this branch when we can use softening, butin this case branch prediction of the microprocessor works

tion and Gflops, on Athlon 64 2.0GHz

Cycles Gflops

94.8 1.27-mfpmath=387 119 1.01

100 1.2097.6 1.2395.0 1.2695.8 1.25

Page 5: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

Table 2Performance of List 3 when N = 1024, in cycles-per-interaction andGflops, on Athlon 64 3000+ 2.0 GHz

Compiler Options Cycles Gflops

GCC 3.3.1 -03 -ffast-math 64.8 1.85GCC 4.0.1 -03 -ffast-math 64.7 1.85ICC 9.0 -03 79.4 1.51

Fig. 1. The use of XMM registers in List 4.

K. Nitadori et al. / New Astronomy 12 (2006) 169–181 173

ideally, hence the performance changes little if we removethis.

Table 1 shows the performance of this code on an AMDAthlon 64 3000+ (2.0 GHz) processor with several differentcompilers. The first column gives the compiler used, the sec-ond the compiler options, the third the clock cycles per pair-wise force calculation, and the fourth column gives the speedin Gflops. We used ‘‘-O3’’ flag with several other more spe-cific flags, since giving a higher optimization flag (GCCaccepts up to ‘‘-O6’’) did not result in any more gain. Allcompilers generate SSE2 instructions instead of x87 instruc-tions unless we explicitly set options to use x87. The perfor-mance is fairly good, but not ideal. In the following we willinvestigate how we can improve the performance.

List 2 is the assembly-language output of the Hermiteforce loop using the GCC compiler. We show the instruc-tions related to the accumulation of acc and jerk only:

List 2. Part of assembly output of list 1 using GCC 3.3.1with option -S-O3-ffast-math -funroll-loops, commentedby the authors for clarity.

One can see that there are quite a few unnecessary load/store instructions. Arrays acc[3] and jerk[3] could beplaced in registers like variable pot, but they are placed inmemory instead. If we look at the entire output (not shownin the list), there are more unnecessary instructions.

64

66

68

70

72

74

76

100 1000 10000

cycl

es p

er lo

op

pf, alignpf, no alignno pf, align

no pf, no align

Fig. 2. Clock cycles per force calculation loop as a function of N onAthlon 64 3000+ 2.0 GHz. Open squares and triangles show the results ofusing code with prefetch and with and without 64-byte alignment. Filledsquares and triangles show the result of using code without prefetch andwith and without 64-byte alignment.

3. C-level optimization

3.1. Code modification

As we have seen in the previous section, the assembly-language output shows a significant amount of unnecessaryload/store instructions. In principle, if the compilers wereclever enough they would be able to eliminate these unnec-essary operations. In practice, we need to guide the compil-ers so that they do not generate unnecessary code.

We have achieved significant speed-up using the follow-ing two guiding principles:

1. Eliminate assignment to any array element in the forceloop.

2. Reuse variables as much as possible in order to minimizethe number of registers used.

Apparently, present-day compilers are not cleverenough to eliminate load/store operations, if elements ofarrays are used as left-hand-side values. Therefore, wehand-unroll all loops of length three and use scalar vari-ables instead of arrays for three-dimensional vectors.

In well-written programs, variables which contain thevalues for different physical quantities should have different

Page 6: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

174 K. Nitadori et al. / New Astronomy 12 (2006) 169–181

names. However, the use of many different names preventsoptimization, since it results in a number of variables toolarge to be fitted into the register set of SSE2 instructions(there are 16 ‘‘XMM’’ registers for SSE2 instructions).Therefore, we explicitly reuse variables such as x, y, z sothat these can hold both the position difference compo-nents as well as the pair-wise force components, at differenttimes in the computation.

The resulted C-language code is as follows:

List 3. An optimized force loop.

With this code, the accumulation of acc and jerk isnow compiled into the following code by the GCC 3.3.1compiler (flag -O3 -ffast-math):

List 4. Part of assembly output of list 3.

We can see that there is no load/store instructions. Infact, we eliminated unnecessary load/store instructionscompletely from the entire force loop. Fig. 1 shows theuse of registers and Table 2 presents the performancedata for this code. This version is 46% faster than thecode in section 2. Very roughly, this speed-up is consis-tent with the reduction in the number of assembly-lan-guage instructions (from 82 to 62), but is somewhatlarger. This is partly because the instructions which takememory arguments have a larger latency than register-only instructions.

3.2. Prefetch insertion and alignment

Line 22 of List 3 shows a built-in function for GCC,which is compiled into a prefetch instruction. In this case,the prefetch instruction loads the data which will be neededtwo iterations after the current one. The second parameteris the read/write flags for which zero indicates preparingfor a read and the third parameter is the degree of temporallocality takes value from 1 to 3.

Line 6 of List 3 shows a pad to make the size of the pre-dictor structure to be exactly 64-byte, which is the cache-line size of Athlon 64 processor. To make the predictorstructure align at 64-byte boundary, we use memalign()or posix_memalign() instead of malloc().

Fig. 2 shows the performance of the force loop as thefunction of the loop length N, with and without this pre-fetch instruction and 64-byte alignment. The fastest casedepends on N. For the region N 6 1024, in which data fitsinto 64 kB of L1 data cache, the code with prefetch and 64-byte alignment is the best. For larger N, the code with pre-fetch and without alignment is the best.

4. Assembly-level optimization

In this section, we give two extensions specialized forx86/x86_64 architecture for the force loop. The first isusing SSE2 vector (SIMD) mode instead of SSE2 scalar(SISD) mode. The second is replacing one division andone square root, by a special instruction for fast approxi-mate inverse square root in SSE and Newton–Raphsoniteration.

Page 7: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

3 We ‘‘found’’ this feature by trial and error.

K. Nitadori et al. / New Astronomy 12 (2006) 169–181 175

4.1. SSE2 vector mode

In the previous section we improved the performance ofthe C-language implementation of the force loop, essen-tially by hand optimizing the C code so that the generatedassembly code becomes optimal. However, in this way, wedid not use the full capability of SSE2. While SSE2 instruc-tions can process two double-precision numbers in parallel,the force loop discussed in the previous section uses onlyone of these two words. Clearly, we did not yet use the‘‘SIMD’’ nature of the instruction set.

Whether or not we can gain by using this SIMD naturedepends on the particular code, and also on the particularprocessor. On Intel P4 architecture the SIMD mode canoffer up to a factor of two speed-up, while on AMD Athlon64 or Intel Pentium M, the speed increase can be small (oreven negative).

There are many ways to use this SIMD feature. Since thereare similarities with the vector instructions of some old vectorprocessors, in particular the Cyber 205, one could make use ofan automatic vectorizing compiler (such as ICC version 6.0and later). However, just as was the case with the old vector-izing compilers, the vectorizing capability of modern compil-ers are still very limited, and it is hard to rewrite the force loopso that the compiler can make use of the SIMD capability.

In fact, part of the reason why vectorization is difficult isthat the load/store capability of SSE2 instructions is ratherlimited: it can work efficiently only on a pair of two double-precision words in consecutive and 16-byte alignedaddresses. This means that the basic loop structure cannotwork, and one needs to copy the data of two particles intosome special data structure, in order to let the compilergenerate the appropriate SIMD instructions.

Here, we have adopted a low-level approach, in which wemake use of the special data type of pairs of double-precisionwords defined in GCC. Basically, this data type, which we callv2df, corresponds to what is in one XMM register (a pair ofdouble-precision words). We can perform the usual arithme-tic operations and even function calls for this data type. Here,is the code which makes use of this v2df data type:

List 5. Defining SSE/SSE2 data type.

The v4sf data type packs four single-precision wordsfor SSE which we will use later.

The basic idea here is to calculate the forces from oneparticle on two different particles in parallel. One couldtry to calculate the forces from, rather than on, two differ-ent particles on one particle in parallel, but that wouldresult in a more complicated program since then we willneed to add up the two partial forces in the end. Also, fromthe point of view of memory bandwidth, our approach ismore efficient, since we need to load only one particle periteration. Note that this is the same approach as what is

called i-parallelism in the various versions of GRAPEhardware, where parallel pipelines calculate the forces ondifferent particles from the same set of particles.

In this code, we use macros for gathering load and scat-tering store using instructions for loading/storing thehigher/lower half of an XMM register:

List 6. Gathering load and scattering store for SSE2.

We should be careful about the fact that these built-infunctions are an undocumented feature of GCC. The GCCdocument (FSF, 2005) describes built-in functions for mostSSE/SSE3 instructions in Sections 5.3 and 5.4, but not forany of SSE2 instruction. We can call most SSE2 instructionthrough the prefix __builtin_ia32_3. Note that, forwhatever reason, instructions such as movxxx from/tomemory are changed toloadxxx/storexxx. But this doesnot mean that GCC always generates correct assembly out-put. For example, GCC generates wrong code for the firstmacro in List 6 when reg is not a assigned to a register var-iable. This can be considered either a bug or a feature, and itshows the risk of using these undocumented aspects of GCC.

We cannot use numerical values like ‘‘3.0’’ for vectoroperations, though GCC supports array like initializationand casting (they are also undocumented features).

List 7. Initialization and casting for numerical values.

Note that we need to copy the data of a single particleinto both high and low words of an XMM register. Thenew SSE3 extension supports this ’’broadcasting’’, whilethe original SSE2 set did not. Therefore, we use the follow-ing macro:

List 8. Broadcast loading for double precision.

We show the vectorized force loop using these datatypes and macros in List 9.

Page 8: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

176 K. Nitadori et al. / New Astronomy 12 (2006) 169–181

List 9. Vectorized force loop using SSE2 vector mode.

The predictor structure has changed to store the samemass value in two places (line 4), in order to save aninstruction (line 53).

4.2. Fast approximate inverse square root

The most expensive part of the force calculation, forrecent microprocessors, is the calculation of an inversesquare root which requires one division and one squareroot calculations. Several attempts to speed this up usingtable lookup, polynomial approximation and Newton–Raphson iteration have been reported (Karp, 1993; Warrenet al., 1997). The main difficulty in these approaches is howto get a good approximation for the starting value of aNewton–Raphson iteration quickly.

Here, we use RSQRTSS/PS instruction in the SSEinstruction set, which provide approximate values for theinverse square root for scalar/vector single-precision float-ing-point numbers, to an accuracy of about 12 bits. Withone Newton–Raphson iteration, we can obtain 24-bit accu-racy, which is sufficient for most ‘‘high-accuracy’’ calcula-tions. If higher accuracy is really necessary, we couldapply a second iteration. The Newton–Raphson iterationformula is expressed as:

x1 ¼ �1

2x0ðax2

0 � 3Þ: ð4Þ

Here, x0 is an initial guess for 1=ffiffiffiap

.We show the implementation for a scalar version using

inline assembly code with GCC extensions:

List 10. Calling rsqrtss through inline assembly code.

and for a vector version using built-in functions:

List 11. Calling rsqrtps using built-in functions of GCC.

Note that we skip here the multiplication by �1/2 in Eq.(4). This can be done after the total force is obtained.

To use RSQRTSS/PS instructions, we first convert r2

into single precision, then apply these instructions, andfinally convert the result back to double precision.

Note that the actual value returned by this RSQRTSS/PS instruction is implementation dependent. In particular,the AMD Athlon 64 processors and the Intel Pentium 4

Page 9: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

-3.0e-04

-2.0e-04

-1.0e-04

0.0e+00

1.0e-04

2.0e-04

1.50 1.51 1.52 1.53 1.54 1.55

erro

r (i

ntel

)

-3.0e-04

-2.0e-04

-1.0e-04

0.0e+00

1.0e-04

2.0e-04

3.0e-04

erro

r (A

MD

)

-3.0e-04

-2.0e-04

-1.0e-04

0.0e+00

1.0e-04

2.0e-04

3.0e-04

4.03.02.01.0

erro

r (i

ntel

)

-3.0e-04

-2.0e-04

-1.0e-04

0.0e+00

1.0e-04

2.0e-04

3.0e-04

erro

r (A

MD

)

Fig. 3. Relative error of the return value of RSQRTSS/PS instructions on AMD Athlon 64 and Intel Pentium 4 for 1 6 x 6 16 (upper) and 1.5 6 x 6 1.55(lower). The errors is periodic for powers of 4.

Table 4Performance of SSE2 scalar mode and vector mode on Athlon 64 2.0 GHz

SSE2 mode Scalar Vectorsqrt operation Normal Fast Normal Fast

Cycles per interaction (cycle) 64.8 70.5 69.0 50.0Calculation speed (Gflops) 1.85 1.70 1.73 2.40

K. Nitadori et al. / New Astronomy 12 (2006) 169–181 177

processors return different values. Fig. 3 shows the errorsof the return values as a function of the input values.The AMD implementation has smaller average errors,but at the same time they show a relatively large systematicbias. Even after one Newton–Raphson iteration in doubleprecision, results from both implementations show rela-tively large biases. Table 3 presents the root mean squareerror, max error, and bias (mean error) of the approximatevalue 1=

ffiffiffixp

, on a Intel Pentium 4 and an AMD Athlon 64before/after one Newton–Raphson iteration. Here, theerror is measured as 1/2(x[rsqrt(x)]2 � 1), and the weightis 1/frac(x), where frac(x) is the normalized fraction of

Table 3Errors of approximate inverse square root

RMSE MAX error Bias

AMD, before N–R 7.96 · 10�5 2.59 · 10�4 2.21 · 10�5

AMD, after N–R 1.57 · 10�8 1.00 · 10�7 �9.51 · 10�9

Intel, before N–R 1.16 · 10�4 3.26 · 10�4 �8.37 · 10�8

Intel, after N–R 3.05 · 10�8 1.60 · 10�7 �2.01 · 10�8

IEEE 754 Single 3.58 · 10�8 8.94 · 10�8 1.52 · 10�11

the floating-point number x. We can correct these biases,at least statistically, by multiplying the resulted force byconstants which depend on the processor type used.

4.3. Performance

Table 4 shows the performance of four different types offorce loops, using SSE2 scalar/vector mode and with/with-out fast inverse square root.

5. Mixed-precision force loop

As we have summarized in the introduction, SSE2 is theSIMD instruction set for pairs of double-precision words.

Page 10: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

178 K. Nitadori et al. / New Astronomy 12 (2006) 169–181

There are also SSE instructions that work on quadruplesof single-precision words. Thus, using the SSE instructionset, we can in principle double the performance. If we per-form the initial subtraction between positions and finalaccumulation of acceleration and potential in double preci-sion, we can use single-precision SSE functions for allother calculations, including subtraction between velocitiesand accumulation of jerk, and still maintain a pretty highaccuracy. The main complexity here is the question of howto make use of four elements of SSE data type, in parallel.The simplest approach is to calculate the forces on fourparticles, but in that case we need too many variables,which do not all fit into the registers. We need twoXMM registers for each element of force in this way.Instead, we have tried to achieve the maximum speed bycalculating the forces from two particles on two particles(making a total of four force calculations) in parallel inSSE.

In the following, we present a detailed description of theimplementation of this mixed-precision force loop usingSSE/SSE2. We use the term i particles for particles whichfeel the gravitational force, and j particles for particleswhich exert the force.

5.1. The data structure for j particles

The data structure for the j particles should contain twoparticles. The code in List 12 achieves this goal by using thedata types v2df and v4sf. Note that for velocities we usethe single-precision v4sf data type, since we do not needdouble-precision accuracy for the velocity. We store thedata of the two particles (with indices j and j + 1) as(j, j + 1) for position, and (j, j, j + 1, j + 1) for velocityand mass.

List 12. The j particle structure.

The variable pad is used to make the size of the struc-ture an exact multiple of 64 bytes.

5.2. The data structure for i particles

We use the following local variables to store two i par-ticles (with indices i0 and i1).

List 13. Two i particles packed into local variables.

Note that xi0 keeps the position of particle i0 in aduplicated way (two 64-bit words of v2df data type store

the same data). Similarly, xi1 keeps the data for i1. In thisway, by subtracting x of the j particle variable from x0 orx1, we can calculate the displacement of two j particlesfrom one i particle. The results are then converted to sin-gle-precision format using a cvtpd2ps instruction (seeList 14). After unpcklps (an instruction to pack twowords into one words, not to unpack) is issued, the con-tents of x become {xj � xi0, xj � xi1, xj+1 � xi0, xj+1 � xi1}.

For the velocity, we store the data in the order of(i0, i1, i0, i1), so that a single subtraction operation providesthe four displacement vectors of two j particles from two i

particles.

List 14. Subtraction between positions.

5.3. Inverse square root and Newton–Raphson iteration

We use the same NR-iteration as in the previous section.Since we do not need the data conversion between singleand double-precision formats, the code here actuallybecomes simpler. List 15 gives this part of the code.

List 15. Calculation of inverse square root inv4sfdata type.

5.4. Accumulating acceleration and potential

Before accumulation of acceleration and potential, wenow have to convert single-precision data back to doubleprecision. Before doing so, we need to split one piece of128-bit data with four single-precision words into two(effectively) 64-bit words with two single-precision words.This is done by the movhlps instruction. Then we convertthe result to double precision using a cvtps2pd instruc-tion. The code appears in List 16.

List 16. Accumulating x element of acceleration.

Page 11: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

K. Nitadori et al. / New Astronomy 12 (2006) 169–181 179

For the jerk, we directly accumulate them in quadruplesof single-precision words, since we do not need double-pre-cision accuracy for jerk. After total force is obtained, weadd up higher two words and lower two words of the accu-mulated data.

5.5. The whole code

List 17 shows the entire code. We provide a simplelibrary which one can call to use this function in a waysimilar to the way the GRAPE hardware is used. Notethat actual code in List 17 is slightly different from codein (Lists 12–16). We use arrays instead of vector typesin the j particle structure. Codes in Lists 14 and 16 areabstracted into macros for readability and saving ofregisters.

List 17. Total code of mixed precision force calculation.

5.6. Performance

Table 5 gives the performance of the code listed above.The second column gives the performance of the code aftera hand tuning of the assembly output.

Table 6 gives the CPU time for actual N-body code.For this run, the N-body model is the plummer modelwith 1024 particles in the N-body unit. The gravitationalforce is softened with the softening parameter e = 1/256.The timestep criterion is Aarseth’s standard criterion withg = 0.01. We integrated the system for two time unit andmeasured the CPU time for the integration from t = 1 tot = 2, to avoid the effect of the startup step. We measuredthe speed of all codes discussed in this paper, on two dif-ferent platforms. One is the AMD Athlon 64 3000+ pro-cessor with the clock speed of 2.0 GHz. We used GCC3.3.1 with option -O3 -ffast-math -funroll-

loops. The other is Intel Xeon EM64T processor withe

Page 12: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

Table 5Performance of List 17 in cycles-per-interaction and Gflops, on Athlon 643000+ 2.0 GHz, when N = 1024

GCC Hand tuning

30.6 cycles 29.6 cycles3.92 Gflops 4.05 Gflops

The first column is performance of the output by GCC 3.3.1 with option -O3, the second is that of hand-tuned assembly code after GCC.

Table 6Summary of the performance of all codes discussed in this paper

Athlon 64 2.0 GHz Xeon EM64T 3.2 GHz

CPU time(s)

Speed -up CPU time(s)

Speed-up

Baseline (List 1) 11.55 1 13.7 1C-level (List 3) 8.07 1.43 13.2 1.03

with fast sqrt 8.83 1.31 11.8 1.16Vectorized

(List 9)8.55 1.35 9.03 1.52

with fast sqrt 6.39 1.81 5.31 2.58Mixed prec

(List 17)3.96 2.92 3.48 3.96

180 K. Nitadori et al. / New Astronomy 12 (2006) 169–181

the clock speed of 3.2 GHz. We used GCC 3.4.2 withoption -O3 -ffast-math -funroll-loops -

march = nocona.In both cases, we have achieved quite remarkable speed-

ups. The best speed is achieved with SSE2/SSE mixed-pre-cision code (List 17) on both platforms, and in this case thespeed-up on Intel Xeon is more than that on AMD Athlon64.

6. Discussion and Summary

In this paper, we describe in detail various ways ofimproving the performance of the force-calculation loopfor gravitational interactions between particles. Since mod-ern microprocessors have many instructions which cannotbe easily exploited by existing compilers, we can achievequite a significant performance improvement if we write afew small library in assembly language and/or using instruc-tion-set specific extensions for the compiler offered to us.

Our implementation will give a significant speed-up foralmost any N-body integration program. In addition, webelieve that similar optimization is possible in many othercompute-intensive applications, within astrophysics as wellas in other areas of physics and science in general.

The source code and documentation are available at:http://grape.astron.s.u-tokyo.ac.jp/~nitadori/phantom/

Acknowledgements

We thank Jumpei Niwa for making his Opteron ma-chines available for the development of the optimized forceloop and also for his helpful advice. We are also grateful toEiichiro Kokubo and Toshiyuki Fukushige for theirencouragement during the development of the code. We

thank the referee of this paper for helpful comments. Wethank all of those who have been involved in the GRAPEproject, which has given us a lot of hints for speeding-upof the gravity calculation. This work was supported in partby the Grants-in-Aid of the Ministry of Education, Science,Culture, and Sport (14079205, 16340057 S.M.), by theGrant-in-Aid for the 21st Century COE ‘‘Center for Diver-sity and Universality in Physics’’ from the Ministry of Edu-cation, Culture, Sports, Science and Technology (MEXT)of Japan. This research was supported in part by the SpecialCoordination Fund for Promoting Science and Technology(GRAPE-DR project), from the Ministry of Education,Culture, Sports, Science and Technology, Japan.

Appendix A. Supplementary data

The online version contains the full assembly-languagelisting for Lists 2 and 3, and also a sample N-body codewith Hermite individual-timestep scheme. Supplementarydata associated with this article can be found, in the onlineversion, at doi:10.1016/j.newast.2006.07.007.

References

Aarseth, S.J., 1963. MNRAS 126, 223.Aarseth, S.J., 2003. Gravitational n-body Simulations. Cambridge Uni-

versity Press, Cambridge, pp. 18–23.AMD, 2004. Software Optimization Guide for AMD Athlon 64 and

AMD Opteron Processors (Publication #:25112, Issue date November2004).

Barnes, J., Hut, P., 1986. Nature 324, 446.Fukushige, T., Makino, J., Kawai, A., 2005. PASJ 57, 1009.Free Software Foundation Inc., 2005. Online Manual of GCC-3.4.4.

Available from: <http://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/>.Hamada, T., Fukushige, T., Makino, J., 2005. PASJ 57, 799.Intel, 2006. IA-32 Intel Architecture Optimization Reference Manual

(Publication #248966-011).Karp, A.H., 1993. Scientific Programming 1, 133.Koma, M., 2005. Available from: <hep-lat/0510029>.Makino, J., Aarseth, S.J., 1992. PASJ 44, 141.Makino, J., Fukushige, T., 2001, In: The SC2001 Proceedings, CD-ROM.

Los Alamitos, IEEE Comp. Soc.Makino, J., Fukushige, T., Koga, M., Namura, K., 2003. PASJ 55, 1163.Sugimoto, D., Chikada, Y., Makino, J., Ito, T., Ebisuzaki, T., Umemura,

M., 1990. Nature 345, 33.van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A.E.,

Berendsen, H.J.C., 2005. J. Comput. Chem. 26, 1701.Warren, M.S., Salmon, J.K., Becker, D.J., Goda, M.P., Sterling, T., 1997.

In: The SC97 Proceedings, CD-ROM. IEEE, Los Alamitos, CA.

Glossary

AMD: Advanced micro devices, name of a semiconductor manufacturer.AMD64: Name of AMD’s 64-bit processors with backward-compatibility

to Intel’s x86 processors.EM64T: Intel’s 64-bit processors with compatibility to AMD64

processors.GCC: Gnu compiler collection. A widely-used compiler for C and other

languages.SIMD: Single-instruction multiple-data. A class of processor architecture

in which multiple data are processed in the same way through singleinstruction.

Page 13: Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

K. Nitadori et al. / New Astronomy 12 (2006) 169–181 181

SSE: Streaming SIMD extensions. Intel’s 4-way SIMD parallel instruc-tion set for 32-bit floating-point and integer data.

SSE2: Intel’s 2-way SIMD parallel instruction set for 64-bit floating-pointdata.

SSE3: Some additions to SSE and SSE2.XMM: The name of registers used with SSE/SSE2/SSE3 instruction set.

x86: Generic name for Intel’s microprocessor architecture originated withIntel i8086 16-bit processor.

x86-64: Generic name for AMD64 and EM64T.x87: Generic name for Intel’s floating-point instruction set originally

supported on i8087 math coprocessor for i8086 microprocessor.