d e n o v o u l t r a s c a l e a t o m is t ic s im u l a...

113DE NOVO ULTRASCALE ATOMISTIC SIMULATIONS

The International Journal of High Performance Computing Applications,Volume 22, No. 1, Spring 2008, pp. 113–128DOI: 10.1177/1094342007085015© 2008 SAGE Publications Los Angeles, London, New Delhi and SingaporeFigures 3, 7 appear in color online: http://hpc.sagepub.com

DE NOVO ULTRASCALE ATOMISTIC SIMULATIONS ON HIGH-END PARALLEL SUPERCOMPUTERS

Aiichiro Nakano1 Rajiv K. Kalia1 Ken-ichi Nomura1 Ashish Sharma1, 2

Priya Vashishta1 Fuyuki Shimojo1,3

Adri C. T. van Duin4

William A. Goddard, III4

Rupak Biswas5

Deepak Srivastava5

Lin H. Yang6

AbstractWe present a de novo hierarchical simulation frameworkfor first-principles based predictive simulations of materialsand their validation on high-end parallel supercomputers andgeographically distributed clusters. In this framework, high-end chemically reactive and non-reactive molecular dynam-ics (MD) simulations explore a wide solution space to dis-cover microscopic mechanisms that govern macroscopicmaterial properties, into which highly accurate quantummechanical (QM) simulations are embedded to validate thediscovered mechanisms and quantify the uncertainty of thesolution. The framework includes an embedded divide-and-conquer (EDC) algorithmic framework for the design of lin-ear-scaling simulation algorithms with minimal bandwidthcomplexity and tight error control. The EDC framework alsoenables adaptive hierarchical simulation with automatedmodel transitioning assisted by graph-based event track-ing. A tunable hierarchical cellular decomposition paralleli-zation framework then maps the O(N) EDC algorithmsonto petaflops computers, while achieving performancetunability through a hierarchy of parameterized cell data/computation structures, as well as its implementation usinghybrid grid remote procedure call + message passing +threads programming. High-end computing platforms suchas IBM BlueGene/L, SGI Altix 3000 and the NSF TeraGridprovide an excellent test grounds for the framework. Onthese platforms, we have achieved unprecedented scales ofquantum-mechanically accurate and well validated, chemi-

cally reactive atomistic simulations—1.06 billion-atom fastreactive force-field MD and 11.8 million-atom (1.04 trilliongrid points) quantum-mechanical MD in the framework ofthe EDC density functional theory on adaptive multigrids—in addition to 134 billion-atom non-reactive space–timemultiresolution MD, with the parallel efficiency as high as0.998 on 65,536 dual-processor BlueGene/L nodes. Wehave also achieved an automated execution of hierarchi-cal QM/MD simulation on a grid consisting of 6 supercom-puter centers in the US and Japan (in total of 150,000processor hours), in which the number of processorschange dynamically on demand and resources are allo-cated and migrated dynamically in response to faults. Fur-thermore, performance portability has been demonstratedon a wide range of platforms such as BlueGene/L, Altix3000, and AMD Opteron-based Linux clusters.

Key words: hierarchical simulation, molecular dynamics,reactive force field, quantum mechanics, density func-tional theory, parallel computing, grid computing

1 Introduction

Petaflops computers (Dongarra and Walker 2001) to bebuilt in the near future and grids (Allen et al. 2001;Kikuchi et al. 2002; Foster and Kesselman 2003) on geo-graphically distributed parallel supercomputers will offertremendous opportunities for high-end computational sci-ences. Their computing power will enable unprecedentedscales of first-principles based predictive simulations toquantitatively study system-level behavior of complexdynamic systems (Emmott and Rison 2006). An exampleis the understanding of microscopic mechanisms that gov-ern macroscopic materials behavior, thereby enablingrational design of material compositions and microstruc-tures to produce desired material properties.

The multitude of length and time scales and the associ-ated wide solution space have thus far precluded such

1COLLABORATORY FOR ADVANCED COMPUTING AND SIMULATIONS, UNIVERSITY OF SOUTHERN CALIFORNIA, LOS ANGELES, CA 90089-0242, USA ([email protected])2DEPARTMENT OF BIOMEDICAL INFORMATICS, OHIO STATE UNIVERSITY, COLUMBUS, OH 43210, USA3DEPARTMENT OF PHYSICS, KUMAMOTO UNIVERSITY, KUMAMOTO 860-8555, JAPAN4MATERIALS AND PROCESS SIMULATION CENTER, CALIFORNIA INSTITUTE OF TECHNOLOGY, PASADENA, CA 91125, USA5NASA ADVANCED SUPERCOMPUTING (NAS) DIVISION, NASA AMES RESEARCH CENTER, MOFFETT FIELD, CA 94035, USA6PHYSICS/H DIVISION, LAWRENCE LIVERMORE NATIONAL LABORATORY, LIVERMORE, CA 94551, USA

© 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. at UNIV OF SOUTHERN CALIFORNIA on February 14, 2008 http://hpc.sagepub.comDownloaded from

http://hpc.sagepub.com

114 COMPUTING APPLICATIONS

first-principles approaches. A promising approach ishierarchical simulation (Broughton et al. 1999; Nakanoet al. 2001; Ogata et al. 2001), in which atomistic molec-ular dynamics (MD) simulations (Kale et al. 1999; Abra-ham et al. 2002; Kadau et al. 2002; Nakano et al. 2002)of varying accuracy and computational costs (from clas-sical non-reactive MD to chemically reactive MD basedon semi-classical approaches) explore a wide solutionspace to discover new mechanisms, in which highlyaccurate quantum mechanical (QM) simulations (Car andParrinello 1985; Kendall et al. 2000; Truhlar and McKoy2000; Gygi et al. 2005; Ikegami et al. 2005) are embed-ded to validate the discovered mechanisms and quantifythe uncertainty of the solution.

A simple estimate indicates that a 100 billion-atom MDsimulation for one nanosecond (or 500,000 time steps),which embeds 200 million-atom reactive MD and 1 mil-lion-atom QM simulations, will require 45 days of com-puting on a petaflops platform. To enable such ultrascalesimulations in the near future, however, nontrivial devel-opments in algorithmic and computing techniques, aswell as thorough scalability tests, are required today.

We are developing a de novo hierarchical simulationframework to enable first-principles based hierarchicalsimulations of materials and their validation on petaflopscomputers and Grids. The framework includes:

• An embedded divide-and-conquer (EDC) algorithmicframework for: 1) the design of linear-scaling algo-rithms for approximate solutions of hard simulationproblems with minimal bandwidth complexity andcodified tight error control; and 2) adaptive embeddingof QM simulations in MD simulation so as to guaran-tee the quality of the overall solution, where graph-based event tracking (i.e. shortest-path circuit analysisof the topology of chemical bond networks) automatesthe embedding upon the violation of error tolerance.

• A tunable hierarchical cellular decomposition (HCD)parallelization framework for: 1) mapping the linear-scaling EDC algorithms onto petaflops computers, whileachieving performance tunability through a hierarchyof parameterized cell data/computation structures; and2) enabling tightly coupled computations of considera-ble scale and duration on distributed clusters, based onhybrid grid remote procedure call + message passing+ threads programming to combine flexibility, faulttolerance, and scalability.

High-end parallel supercomputers such as IBM Blue-Gene/L and SGI Altix 3000, as well as grid test-bedssuch as the NSF TeraGrid, are excellent test grounds forsuch scalable de novo hierarchical simulation technolo-gies. This paper describes scalability tests of our hierar-chical simulation framework on these platforms as well

as its portability to other platforms such as AMDOpteron-based Linux clusters. In the next section, wedescribe the EDC algorithmic framework. Section 3 dis-cusses the tunable HCD parallelization framework. Resultsof benchmark tests are given in Section 4, and Section 5contains conclusions.

2 Embedded Divide-and-Conquer Algorithms for De Novo Hierarchical Simulations

Prerequisite to successful hierarchical simulations andvalidation at the petaflops scale are simulation algorithmsthat are scalable beyond 105 processors. We use a unifiedembedded divide-and-conquer (EDC) algorithmic frame-work based on data locality principles to design linear-scaling algorithms for broad scientific applications withtight error control (Shimojo et al. 2005; Nakano et al. 2006).In EDC algorithms, spatially localized sub-problems aresolved in a global embedding field, which is efficientlycomputed with tree-based algorithms (Figure 1). Exam-ples of the embedding field are: 1) the electrostatic field inmolecular dynamics (MD) simulation (Nakano et al.2002); 2) the self-consistent Kohn-Sham potential in thedensity functional theory (DFT) in quantum mechanical(QM) simulation (Shimojo et al. 2005); and 3) a coarserbut less compute-intensive simulation method in hierar-chical simulation (Nakano et al. 2001).

2.1 Linear-Scaling Molecular-Dynamics and Quantum-Mechanical Simulation Algorithms

In the past several years, we have used the EDC frame-work to develop a suite of linear-scaling MD algorithms,in which interatomic forces are computed with varyingaccuracy and complexity: 1) classical MD involving theformally O(N2) N-body problem; 2) reactive force-field(ReaxFF) MD involving the O(N3) variable N-chargeproblem; 3) quantum mechanical (QM) calculation basedon the DFT to provide approximate solutions to the expo-nentially complex quantum N-body problem; and 4)adaptive hierarchical QM/MD simulations that embedhighly accurate QM simulations in MD simulation onlywhen and where high fidelity is required. The Appendixdescribes the three EDC simulation algorithms that areused in our adaptive hierarchical simulations:

• Algorithm 1—MRMD: space–time multiresolution mole-cular dynamics.

• Algorithm 2—F-ReaxFF: fast reactive force-field mole-cular dynamics.

• Algorithm 3—EDC-DFT: embedded divide-and-con-quer density functional theory on multigrids for quan-tum-mechanical molecular dynamics.




2.2 Controlled Errors of Embedded Divide-and-Conquer Simulation Algorithms

A major advantage of the EDC simulation algorithms forautomated hierarchical simulations and validation is theease of codifying (i.e. turning into a coded representa-tion, in terms of programs, which is mechanically execut-able by other program components) error management.The EDC algorithms have a well-defined set of localiza-tion parameters, with which the computational cost and

the accuracy are controlled. Figures 2(a) and 2(b) showthe rapid convergence of the EDC-DFT energy as a func-tion of its localization parameters (the size of a domainand the length of buffer layers to augment each domainfor avoiding artificial boundary effects). The EDC-DFTMD algorithm has also overcome the energy drift prob-lem, which plagues most O(N) DFT-based MD algo-rithms, especially with large basis sets (> 104 unknownsper electron, necessary for the transferability of accuracy;see Figure 2(c); Shimojo et al. 2005).

Fig. 1 (a) Schematic of embedded divide-and-conquer (EDC) algorithms. The physical space is subdivided into spa-tially localized cells, with local atoms constituting sub-problems (bottom), which are embedded in a global field(shaded) solved with a tree-based algorithm. (b) In tunable hierarchical cellular decomposition (HCD), the physicalvolume is subdivided into process groups, PGγγγγ, each of which is spatially decomposed into processes, Pγγγγ

ππππ. Eachprocess consists of a number of computational cells (e.g. linked-list cells in MD or domains in EDC-DFT) of size lcell,which are traversed concurrently by threads (denoted by dots with arrows) to compute blocks of cells. Pγγγγ

ππππ is aug-mented with nlayer layers of cached cells from neighbor processes.

Fig. 2 Controlled convergence of the potential energy of amorphous CdSe by localization parameters: (a) domainsize (with the buffer size fixed as 2.854 a.u.); (b) buffer length (with the domain size fixed as 11.416 a.u.). Numeralsare the number of self-consistent iterations required for the convergence of the electron density within 10−−−−4 of thebulk density. (c) Energy conservation in EDC-DFT based MD simulation of liquid Rb at 1400 K. The domain andbuffer sizes are 16.424 and 8.212 a.u., respectively.




2.3 Adaptive Hierarchical Simulation Framework

Adaptive hierarchical simulation combines a hierarchy ofMD algorithms (e.g. MRMD, F-ReaxFF, and EDC-DFTdescribed above) to enable atomistic simulations that areotherwise too large to solve, while retaining QM accu-racy (Broughton et al. 1999; Nakano et al. 2001; Ogataet al. 2001, 2004). The EDC framework achieves this byusing less compute-intensive coarse simulation as anembedding field. We have developed an adaptive EDChierarchical simulation framework, which embeds accu-rate but compute-intensive simulations in coarse simula-tion only when and where high fidelity is required. Thehierarchical simulation framework consists of: 1) hierar-chical division of the physical system into subsystems ofdecreasing sizes and increasing quality-of-solution (QoSn)requirements, S0 ⊃ S1 ⊃ … ⊃ Sn; and 2) a suite of simula-tion services Mα (α = 0, 1, …, n) of ascending order ofaccuracy (e.g. MRMD < F-ReaxFF < EDC-DFT). In theadditive hybridization scheme, an accurate estimate ofthe energy of the entire system is obtained from therecurrence relation (Dapprich et al. 1999; Ogata et al.2001),

(1)

This modular additive hybridization scheme not onlyallows the reuse of existing simulation codes but alsominimizes the interdependence and communicationbetween simulation modules. To further expose the datalocality of hybrid QM/MD simulation, the EDC frame-work embeds EDC-DFT simulations of a number ofdomains within MD simulation of the total system:

(2)

Traditionally, termination atoms are added to the QMand MD domains to minimize artificial boundary effects.However, the solution is sensitive to the choice of the ter-mination atoms, and thus the domains need to be deter-mined manually before the simulation (Ogata et al.2001). The buffer-layer approach of the EDC-DFT algo-rithm considerably reduces this sensitivity, and accord-ingly we are among the first to automate the adaptivedomain redefinition during simulations (Takemiya et al.2006).

2.4 Graph-Based Event Tracking for Adaptive Hierarchical Simulations and Validation

An MD simulation method usually has a validation data-base that compares MD values with corresponding higher-accuracy QM values for a set of physical properties ofselected atomic configurations (i.e. the training set). Forexample, the ReaxFF potential of RDX (1,3,5-trinitro-1,3,5-triazine) has an extensive training database of DFTcalculations not only for 1,600 equilibrated molecularfragments but also for 40 key reaction pathways (i.e. chainsof intermediate structures that interpolate the structuresof reactant and product molecules; Strachan et al. 2003).Our QM/MD simulation ensures the overall accuracy byusing MD simulation only within its range of validity.Our current adaptive hierarchical simulation manages theerror based on a simple heuristic, i.e. the deviation ofbond lengths from their equilibrium values, for which theMD interatomic potential has been trained (Takemiyaet al. 2006). For tighter error management, we extendthis approach by encoding the deviation of local topolog-ical structures of atoms from those in the training set,based on an abstraction of the structures as a graph andits shortest-path circuit analysis. Though the modelingerror is readily quantified by |δEα/α−1(Si+1)| in equation (1)and can be reduced by incrementally enlarging the size ofthe high-accuracy subsystem once it is defined, the chal-lenge here is to speculate the error in advance so as tominimize subsequent readjustments (if the subsystem isdefined too small) or costly speculative high-accuracysimulations (if it is too large).

We abstract the structure of a material as a graph G =(V, E), in which atoms constitute the set of vertices V, andthe edge set E consists of chemical bonds. Bonds aredefined between a pair of atoms for which Pauling’s bondorder is larger than a threshold value or the pair distance isless than a cutoff radius. Each vertex is labeled with itsthree-dimensional position and auxiliary annotations suchas atomic species, and each bond with attributes such asbond length and chemical bond order. Given a vertex xand two of its neighbors w and y of G, a shortest-path cir-cuit generated by triplet (w, x, y) is any closed path thatcontains edges (w, x) and (x, y) and has a shortest path (w,y) in graph G−x (Franzblau 1991). The shortest-path cir-cuit analysis has been used successfully to characterizetopological order of amorphous materials and to identifyand track topological defects such as dislocations (Figure 3;Rino et al. 1993; Nakano, Kalia, and Vashishta 1999; Szl-ufarska, Nakano, and Vashishta 2005; Branicio et al.2006). Efficient algorithms with near linear scaling areessential for the graph analysis to be embedded as part ofsimulation in real time. We have developed a scalable par-allel shortest-path circuit analysis algorithm with smallmemory usage, based on dual-tree expansion and spatial

Eα Si( ) Eα 1– Si( ) Eα Si 1+( ) Eα 1– Si 1+( )–+=

Eα 1– Si( ) δEα α 1–⁄ Si 1+( ).+=

E total( ) EMD total( )=

EQM domain( ) EMD domain( )–[ ]domain∑+

EMD total( ) δEQM MD⁄ domain( ).domain∑+=




hash-function tagging (SHAFT; Zhang et al. 2006).SHAFT utilizes the vertex-position label to design a com-pact, collision-free hash table, thereby avoiding the degra-dation of cache utilization for large graphs.

The graph abstraction is also used to track discreteevents to automate the discovery of mechanisms, i.e.cause-effect relations on a sequence of well-delineatedmicroscopic events that govern system-level macroscopicbehavior. An example is a damage mechanism recentlydiscovered by our MD simulation, in which the intersec-tion of an elastic shock-wave front (detected as a discretejump in pressure) and a high-pressure structural transfor-mation front (the boundary between a sub-graph of vertexdegree 4 and that of vertex degree 6) nucleates topologi-cal defects and eventually causes the fracture of ceramicunder impact (Branicio et al. 2006). Large classical MDsimulations involving multibillion atoms are performedto search for events within a wide solution space, andonly when distinct events are detected by the graph anal-ysis, QM simulations are invoked only where the localtopological anomaly has been detected.

3 Tunable Hierarchical Cellular Decomposition Parallelization Framework

Data locality principles are key to developing a scalableparallel computing framework as well. We have devel-oped a tunable hierarchical cellular decomposition (HCD)framework for mapping the O(N) EDC algorithms ontomassively parallel computers with deep memory hierar-chies. The HCD maximally exposes data locality andexploits parallelism at multiple decomposition levels,

while providing performance tunability (Whaley, Petitet,and Dongarra 2001) through a hierarchy of parameterizedcell data/computation structures (Figure 1b). At the finestlevel, the EDC algorithms consist of computational cells—linked-list cells (which are identical to the octree leaf cellsin the fast multipole method; Greengard and Rokhlin1987) in MRMD and F-ReaxFF (Nakano et al. 2002), ordomains in EDC-DFT (Shimojo et al. 2005). In the HCDframework, each compute node (often comprising multi-ple processors with shared memory) of a parallel compu-ter is identified as a subsystem (Pγ

π in Figure 1b) in spatialdecomposition, which contains a large number of compu-tational cells. Our EDC algorithms are implemented ashybrid message passing interface (MPI; Gropp, Lusk, andSkjellum 1999) + shared memory (OpenMP; Chandra etal. 2000) programs, in which inter-node communicationfor caching and migrating atoms between Pγ

π’s is handledwith messages, whereas loops over cells within each Pγ

π(or MPI process) are parallelized with threads (denoted asdots with arrows in Figure 1b). To avoid performance-degrading critical sections, the threads are ordered byblocking cells, so that the atomic n-tuples being processedby the threads share no common atom. On top of compu-tational cells, cell-blocks, and spatial-decomposition sub-systems, the HCD framework introduces a coarser levelof decomposition by defining process groups (PGγ = ∪πPγ

π in Figure 1b) as MPI communicators (within a tightlycoupled parallel computer) or grid remote procedure calls(Tanaka et al. 2003; on a grid of clusters).

Our programs are designed to minimize global opera-tions across PGγ’s and to overlap computations withinter-group communications (Kikuchi et al. 2002). For

Fig. 3 Network of topological defects (or dislocations) during hypervelocity impact of aluminum nitride ceramic(Branicio et al. 2006). Only atoms that participate in non-6-membered circuits are visualized, where the color repre-sents the pressure value. (A perfect crystal has only 6-membered circuits.)




example, the potential energy is computed locally withineach group and the global sum is computed only when itneeds to be reported to the user. Also our spatial decom-position scheme splits the computations on each proces-sor into those involving only interior linked-list cells andthose involving boundary cells. The interior computationis then fully overlapped with the communication of theboundary data. The effect of these latency-hiding tech-niques on performance is most noticeable on grid envi-ronments, since the communication overhead is alreadyvery small on each parallel supercomputer as shown inSection 4.

The cellular data structure offers an effective abstrac-tion mechanism for performance optimization. We opti-mize both data layouts (atoms are sorted according totheir cell indices and the linked lists) and computationlayouts (force computations are re-ordered by traversingthe cells according to a spacefilling curve, a mappingfrom the 3-D space to a 1-D list). Cells are traversed alongeither a Morton curve (Figure 1b) or a Hilbert curve (Moonet al. 2001). In a multi-threading case, the Morton curveensures maximal separation between the threads and thuseliminates critical sections. Furthermore, the cell size ismade tunable to optimize the performance. There is also atrade-off between spatial-decomposition/message-pass-ing and threads parallelisms in the hybrid MPI+OpenMPprograms (Henty 2000; Shan et al. 2002, 2003). Whilespatial decomposition involves extra computation oncached cells from neighbor subsystems, its disjoint memorysubspaces are free from shared-memory protocol over-head. The computational cells are also used in our multi-layer cellular decomposition scheme for inter-node cachingof atomic n-tuple (n = 2–6) information (Figure 1b), wheren changes dynamically in the MTS or MPCG algorithm(see Appendix). The Morton curve also facilitates a datacompression algorithm based on data locality to reducethe I/O. The algorithm uses octree indexing and sorts atomsaccordingly on the resulting Morton curve (Omeltchenkoet al. 2000). By storing differences between successiveatomic coordinates, the I/O requirement for a given errortolerance level reduces from O(NlogN) to O(N). An adap-tive, variable-length encoding scheme is used to make thescheme tolerant to outliers and optimized dynamically.An order-of-magnitude improvement in the I/O perform-ance was achieved for MD data with user-controlled errorbound. The HCD framework includes a topology-preserv-ing computational spatial decomposition scheme to mini-mize latency through structured message passing andload-imbalance/communication costs through a wavelet-based load-balancing scheme (Nakano and Campbell 1997;Nakano 1999).

High-end hierarchical simulations often run on thou-sands of processors for months. Grids of geographicallydistributed parallel supercomputers could satisfy the need

of such ‘sustained’ supercomputing. In collaborationwith scientists at the National Institute for AdvancedIndustrial Science and Technology (AIST) and the NagoyaInstitute of Technology in Japan, we have recently pro-posed a sustainable grid supercomputing paradigm, inwhich supercomputers that constitute the grid changedynamically according to a reservation schedule as wellas to faults (Takemiya et al. 2006). We use a hybrid gridremote procedure call (GridRPC) + MPI programming tocombine flexibility and scalability. GridRPC enablesasynchronous, coarse-grained parallel tasking and hidesthe dynamic, insecure and unstable aspects of the gridfrom programmers (we have used the Ninf-G GridRPCsystem, http://ninf.apgrid.org), while MPI supports effi-cient parallel execution on clusters.

4 Performance Tests

Scalability tests of the three parallel simulation algo-rithms, MRMD, F-ReaxFF and EDC-DFT, have beenperformed on a wide range of platforms, including the10,240-processor SGI Altix 3000 at the NASA AmesResearch Center, the 131,072-processor IBM BlueGene/Lat the Lawrence Livermore National Laboratory (LLNL),and the 2048-processor AMD Opteron-based Linux clus-ter at the University of Southern California (USC). Wehave also tested our sustainable grid supercomputingframework for hierarchical QM/MD simulations on agrid of 6 supercomputer centers in the US and Japan. Thecodes have been ported without any modifications to allthe platforms, except that only the pure MPI implementa-tions have been run on BlueGene/L since it does not sup-port OpenMP.

4.1 Experimental Platforms

The SGI Altix 3000 system, named Columbia, at NASAAmes uses the NUMAflex global shared-memory archi-tecture, which packages processors, memory, I/O, inter-connect, graphics, and storage into modular componentscalled bricks (detailed information is found at http://www.nas.nasa.gov/Resources/Systems/columbia.html). The com-putational building block of Altix is the C-Brick, whichconsists of four Intel Itanium2 processors (in two nodes),local memory, and a two-controller application-specificintegrated circuit called the Scalable Hub (SHUB). EachSHUB interfaces to the two CPUs within one node, alongwith memory, I/O devices, and other SHUBs. The Altixcache-coherency protocol implemented in the SHUBintegrates the snooping operations of the Itanium2 and thedirectory-based scheme used across the NUMAflex inter-connection fabric. A load/store cache miss causes the datato be communicated via the SHUB at the cache-line gran-ularity and automatically replicated in the local cache.




The 64-bit Itanium2 processor operates at 1.5 GHz andis capable of issuing two multiply-add operations percycle for a peak performance of 6 Gflops. The memoryhierarchy consists of 128 floating-point registers and threeon-chip data caches (32 KB L1, 256 KB L2, and 6 MBL3). The Itanium2 cannot store floating-point data in L1,making register loads and spills a potential source of bot-tlenecks; however, a relatively large register set helpsmitigate this issue. The superscalar processor implementsthe Explicitly Parallel Instruction set Computing (EPIC)technology, where instructions are organized into 128-bitVLIW bundles. The Altix platform uses the NUMAlink3interconnect, a high-performance custom network in afat-tree topology, in which the bisection bandwidth scaleslinearly with the number of processors. Columbia runs64-bit Linux version 2.4.21. Our experiments use a 6.4TB parallel XFS file system with a 35-fiber optical chan-nel connection to the CPUs.

Columbia is configured as a cluster of 20 Altix boxes,each with 512 processors and 1 TB of global shared-accessmemory. Of these 20 boxes, 12 are model 3700 and theremaining eight are BX2—a double-density version of the3700. Four of the BX2 boxes are linked with NUMAlink4technology to allow the global shared-memory con-structs to significantly reduce inter-processor communi-cation latency. This 2,048-processor subsystem withinColumbia provides a 13 Tflops peak capability platform,and was the basis of the computations reported here.

The BlueGene/L (http://www.llnl.gov/ASC/platforms/bluegenel) has been developed by IBM and LLNL, and ituses a large number of low power processors coupled withpowerful interconnects and communication schemes. TheBlueGene/L comprises of 65,536 compute nodes (CN),each with two IBM PowerPC 440 processors (at 700MHz clock speeds) and 512 MB of shared memory. Thetheoretical peak performance is 5.6 Gflops per CN, or 367Tflops for the full machine. Each processor has a 32 KBinstruction and data cache, a 2 MB L2 cache and a 4 MBL3 cache, which is shared with the other processor on theCN. Each CN has two floating-point units that can per-form fused multiply-add operations. In its default mode(co-processor mode), one of the processors in the CNmanages the computation, while the other processormanages the communication. In an alternative mode ofoperation (virtual mode), both processors can be used forcomputation. It uses a highly optimized lightweightLinux distribution and does not allow access to individualnodes.

The nodes are interconnected through multiple com-plementary high-speed low-latency networks, including a3D torus network and a tree network. The CNs are con-nected as a 64 × 32 × 32 3-D torus, which is used forcommon inter-processor communications. The tree net-work is optimized for collective operations such as broad-

cast and gather. The point-to-point bandwidth of the 3-Dtorus network is 175 MB/s per link and 350 MB/s for thetree network.

The Opteron cluster used for the scalability test con-sists of 512 nodes, each with two dual-core AMD Opteronprocessors (at 2 GHz clock speeds) and 4 GB of memory(in total of 2,048 cores), which is part of a 5,472-processorLinux cluster at USC (http://www.usc.edu/hpcc/systems/l-overview.php). Each core has a 64 KB instructioncache, a 64 KB data cache, and a 1 MB L2 cache. A frontside bus operating at 2 GHz provides a maximum I/Obandwidth of 24 GB/s. The floating-point part of theprocessor contains three units: a Floating Store unit thatstores results to the Load/Store Queue Unit and FloatingAdd and Multiply units that can work in superscalarmode, resulting in two floating-point results per clock cycle.The 512 nodes are interconnected with Myrinet, whichprovides over 200 MB/s in a ping-pong experiment.Besides the high bandwidth, an advantage of Myrinet isthat it entirely operates in the user space, thus avoidingoperating systems interference and associated delays.This reduces the latency for small messages to 15 µs.

4.2 Scalability Test Results

We have first tested the trade-off between MPI andOpenMP parallelisms on various shared-memory archi-tectures. For example, we have compared different com-binations of the number of OpenMP threads per MPIprocess, ntd, and that of MPI processes, np, while keepingP = ntd × np constant, on P = 8 processors in an 8-way 1.5GHz Power4 node. The optimal combination of (ntd, np)with the minimum execution time is (1, 8) for the MRMDalgorithm for an 8,232,000-atom silica material and is (4, 2)for the F-ReaxFF algorithm for a 290,304-atom RDXcrystal. Since BlueGene/L does not support OpenMP, wewill use ntd = 1 in the following performance compari-sons.

Figure 4(a) shows the execution time of the MRMDalgorithm for silica material as a function of the numberof processors P on Columbia. We scale the problem sizelinearly with the number of processors, so that thenumber of atoms N = 1,029,000P (P = 1, …, 1920). Inthe MRMD algorithm, the interatomic potential energy issplit into the long-range and short-range contributions,and the long-range contribution is computed every 10MD time steps. The execution time increases onlyslightly as a function of P, and this signifies an excellentparallel efficiency. We define the speed of an MD pro-gram as a product of the total number of atoms and timesteps executed per second. The constant-granularity spee-dup is the ratio between the speed of P processors andthat of one processor. The parallel efficiency is the spee-dup divided by P. On 1920 processors, the isogranular




parallel efficiency of the MRMD algorithm is 0.87. Abetter measure of the inter-box scaling efficiency basedon NUMAlink4 is the speedup from 480 processors in 1box to 1920 processors in 4 boxes, divided by the numberof boxes. On 1920 processors, the inter-box scaling effi-ciency is 0.977. Also the algorithm involves very smallcommunication time, see Figure 4(a).

Figure 4(b) shows the execution time of the F-ReaxFFMD algorithm for RDX material as a function of P,where the number of atoms is N = 36,288P. The compu-tation time includes 3 conjugate gradient (CG) iterationsto solve the electronegativity equalization problem fordetermining atomic charges at each MD time step. On1,920 processors, the isogranular parallel efficiency ofthe F-ReaxFF algorithm is 0.953 and the inter-box scal-ing efficiency is 0.995.

Figure 4(c) shows the performance of the EDC-DFTbased MD algorithm for 720P atom alumina systems. Inthe EDC-DFT calculations, each domain of size 6.66 ×5.76 × 6.06 Å3 contains 40 electronic wave functions,where each wave function is represented on 283 = 21,952grid points. The execution time includes 3 self-consistent(SC) iterations to determine the electronic wave functionsand the Kohn-Sham potential, with 3 CG iterations per SCcycle to refine each wave function iteratively. The largestcalculation on 1,920 processors involves 1,382,400 atoms,for which the isogranular parallel efficiency is 0.907 andthe inter-box scaling efficiency is 0.966.

The largest benchmark tests on Columbia include18,925,056,000-atom MRMD, 557,383,680-atom F-ReaxFF, and 1,382,400-atom (121,385,779,200 electronicdegrees-of-freedom) EDC-DFT calculations. We haveobserved perfect linear scaling for all the three algo-rithms, with prefactors spanning five orders-of-magnitude.The only exception is the F-ReaxFF algorithm below 100

million atoms, where the execution time scales even sub-linearly. This is due to the decreasing communicationoverhead, which scales as O((N/P)−1/3).

The EDC simulation algorithms are portable and havebeen run on various high-end computers including IBMBlueGene/L and dual-core AMD Opteron. Since theEDC framework exposes maximal locality, the algorithmsscale well consistently on all platforms. Figure 5 com-pares the isogranular parallel efficiency of the F-ReaxFFMD algorithm for RDX material on Altix 3000, Blue-Gene/L, and Opteron. In Figure 5(a) for Altix 3000, thegranularity is the same as that in Figure 4(b). Figure 5(b)shows the parallel efficiency as a function of P on Blue-Gene/L, where the number of atoms is N = 36,288P. On65,536 BlueGene/L nodes (the computation uses onlyone processor per node in the co-processor mode), theisogranular parallel efficiency of the F-ReaxFF algorithmis over 0.998. Figure 5(c) shows the parallel efficiency ofF-ReaxFF on Opteron, where the number of atoms is N =107,520P. The measurements on Opteron have been cur-ried out on one core per CPU for P = 1 and two cores perCPU for the other cases. The inter-core communicationoverhead (between P = 1 and 2) is negligible. The intra-node bandwidth (P = 4) and network speed (P ≥ 8) affectthe total execution time by 4~12%. The sharp drop ofefficiency in Figure 5(c) above 1000 cores may be attrib-uted to interference with other jobs on the Linux cluster,which was in general use during the non-dedicated scala-bility test. The parallel efficiency is high for all threeplatforms, where the higher efficiency is achieved for aplatform with the higher communication-bandwidth/proc-essor-speed ratio (in descending order for BlueGene/L >Altix 3000 > Opteron/Myrinet).

Major design parameters for reactive and nonreactiveMD simulations of materials include the number of

Fig. 4 Total execution (circles) and communication (squares) times per MD time step as a function of the number ofprocessors P (= 1, …, 1920) of Columbia, for three MD simulation algorithms: (a) MRMD for 1,029,000 P atom silicasystems; (b) F-ReaxFF MD for 36,288P atom RDX systems; and (c) EDC-DFT MD for 720P atom alumina systems.




atoms in the simulated material and the method to com-pute interatomic forces (classically in MRMD, semi-empirically in F-ReaxFF MD, or quantum-mechanicallyin EDC-DFT MD). Figure 6 shows a design-space dia-gram for MD simulations on BlueGene/L, Altix 3000,

and Opteron. The largest benchmark tests in this studyinclude 133,982,846,976-atom MRMD, 1,056,964,608-atom F-ReaxFF, and 11,796,480-atom (1,035,825,315,840electronic degrees-of-freedom) EDC-DFT calculationson 65,536 dual-processor BlueGene/L nodes.

Different characteristics of the MRMD, F-ReaxFF andEDC-DFT algorithms are reflected in their floating-pointperformances. The interatomic potential in MRMD isprecomputed and tabulated as a function of the intera-tomic distance. The MRMD computation is thus predom-inantly table look-ups for atomic pairs and triplets. The F-ReaxFF algorithm, on the contrary, performs a large num-ber of floating-point operations, but it involves more com-plex list management for atomic n-tuples (n = 2–6). Incontrast to these particle-based algorithms, the EDC-DFTalgorithm deals with wave functions on regular meshpoints. In all the three O(N) algorithms, however, the datalayout and computations are highly irregular comparedwith their higher-complexity counterparts. The floating-point performances of the MRMD (N/P = 1,029,000), F-ReaxFF (N/P = 36,288) and EDC-DFT (N/P = 720) algo-rithms on 1,920 Itanium2 processors are 1.31, 1.07, and1.49 Tflops, respectively, whereas the theoretical peakperformance is 11.5 Tflops.

4.3 Grid Test Results

Using our sustainable grid supercomputing framework,we have achieved an automated execution of hierarchicalQM/MD simulation on a grid consisting of 6 supercom-puter centers in the US (USC and two NSF TeraGridnodes at the Pittsburgh Supercomputing Center and theNational Center for Supercomputing Applications) andJapan (AIST, University of Tokyo, and Tokyo Institute

Fig. 5 Isogranular parallel efficiency of F-ReaxFF as a function of the number of processors P for RDX material: (a)the number of atoms per processor, N/P = 36,288 on Altix 3000 (P = 1, …,1920); (b) N/P = 10,752 on BlueGene/L (P =1, …, 65536); and (c) N/P = 107,520 on an AMD Opteron cluster (P = 1, …, 2000).

Fig. 6 Benchmark tests of reactive and nonreactiveMD simulations on 1920 Itanium2 processors of theAltix 3000 at NASA (open symbols), 2000 Opteronprocessors at USC (solid symbols), and 65,536 dual-processor BlueGene/L nodes at LLNL (shaded sym-bols). The execution time per MD step is shown as afunction of the number of atoms for: quantum-mechanical MD based on the embedded divide-and-conquer density functional theory (EDC-DFT, circles);fast reactive force-field MD (F-ReaxFF, squares); andnonreactive space–time multiresolution MD (MRMD,triangles). Lines show O(N) scaling.




of Technology; Takemiya et al. 2006). The simulationwas sustained autonomously on ~700 processors for 2weeks, involving in total of 150,000 CPU-hours, wherethe number of processors changed dynamically on demandand resources were allocated and migrated dynamicallyaccording to both reservations and unexpected faults.

5 Conclusions

We have motivated and supported the need for petaflopscomputing for advanced materials research, and havedemonstrated that judicious use of divide-and-conqueralgorithms and hierarchical parallelization frameworksshould make these applications highly scalable on peta-flops platforms. We have illustrated the value of this workwith real-world experiments involving quantum-mechan-ical and molecular-dynamics simulations on high-endparallel supercomputers such as SGI Altix 3000, IBMBlueGene/L and AMD Opteron, as well as on a grid ofglobally distributed parallel supercomputers.

We are currently applying the de novo hierarchicalsimulation framework to study deformation and damagemechanisms of nanophase ceramics and nanoenergeticmaterials in harsh environments, thereby assisting thedesign of superhard, tough and damage-tolerant nanoma-terials as well as nanoenergetic materials with high spe-cific impact and reduced sensitivity.

One application is hypervelocity impact damage ofadvanced ceramics [aluminum nitride (AlN), silicon car-bide, and alumina; Branicio et al. 2006], for which wehave recently performed 500 million-atom MD simula-tions. The simulation has revealed atomistic mechanismsof fracture accompanying structural phase transformationin AlN under hypervelocity impact at 15 km/s. We areextending these classical MD simulations to those involv-ing surface chemical reactions under high temperatures andflow velocities relevant to micrometeorite impact dam-ages to the thermal and radiation protection layers of aer-ospace vehicles, understanding of which is essential forsafer space flights.

Another application is the combustion of nanoener-getic materials. We have performed 1.3 million-atom F-ReaxFF MD simulations to study shock-initiated detona-tion of RDX (1,3,5-trinitro-1,3,5-triazine, C3N6O6H6)matrix embedded with aluminum nanoparticles (n-Al) on1024 dual-core Opteron processors at the Collaboratoryfor Advanced Computing and Simulations of USC (Fig-ure 7). In the simulation, a 320 × 210 × 204 Å3 RDX/n-Alcomposite is impacted by a plate at a velocity of 5 km/s.The simulation has revealed atomistic processes of shockcompression and subsequent explosive reaction. Strongattractive forces between oxygen and aluminum atomsbreak N–O and N–N bonds in the RDX and, subsequently,the dissociated oxygen atoms and NO molecules oxidize

Al, which has also been observed in our DFT-based MDsimulation (Umezawa et al. 2006).

De novo hierarchical simulations also have broad appli-cations in nanoelectronics (Ogata et al. 2004). The hybridQM/MD simulation on the US-Japan Grid described inSection 4 has studied the SIMOX (separation by implan-tation by oxygen) technique for fabricating high speedand low power-consumption semiconductor devices. Thesimulation of the implantation of oxygen atoms toward aSi substrate has revealed a strong dependence of the oxy-gen penetration depth on the incident beam position,which should be taken into consideration in extending theSIMOX technique to lower incident energies.

These applications on high-end computing platformstoday are paving the way for predictive, first-principlessimulation-based sciences in coming years (Emmott andRison 2006).

Appendix: Embedded Divide-and-Conquer Simulation Algorithms

This Appendix describes computational characteristics ofthe three embedded divide-and-conquer simulation algo-rithms that are used in our adaptive hierarchical simula-tions: 1) MRMD: space–time multiresolution moleculardynamics; 2) F-ReaxFF: fast reactive force-field molecu-lar dynamics; and 3) EDC-DFT: embedded divide-and-conquer density functional theory on adaptive multigridsfor quantum-mechanical molecular dynamics.

Fig. 7 F-ReaxFF MD simulation of n-Al/RDX simula-tion shown on a tiled display at the Collaboratory forAdvanced Computing and Simulations of USC. Visuali-zation software, with embedded graph analysis algo-rithms (Zhang et al. 2006), has been developed bySharma et al. (2003).




Algorithm 1—MRMD: Space–Time Multiresolution Molecular Dynamics

MRMD is used as a template for developing broad particleand continuum simulation algorithms. The MD approachfollows the time evolution of the positions, rN = ri|i = 1,…, N, of N atoms by solving coupled ordinary differen-tial equations (Nakano et al. 2002). Atomic force law ismathematically encoded in the interatomic potential energyEMD(rN), which is often an analytic function EMD(rij,rijk) of atomic pair, rij, and triplet, rijk, positions. Forthe long-range electrostatic interaction, we use the fastmultipole method (FMM) to reduce the O(N2) computa-tional complexity of the N-body problem to O(N) (Green-gard and Rokhlin 1987). In the FMM, the physical systemis recursively divided into subsystems to form an octreedata structure, and the electrostatic field is computedrecursively on the octree with O(N) operations, whilemaintaining spatial locality at each recursion level. Ourscalable parallel implementation of the FMM has a uniquefeature to compute atomistic stress tensor componentsbased on a complex charge method (Ogata et al. 2003).MRMD also utilizes temporal locality through multipletime stepping, which uses different force-update sched-ules for different force components (Martyna et al. 1994;Schlick et al. 1999; Nakano et al. 2002). Specifically,forces from neighbor atoms are computed at every MDstep, whereas forces from farther atoms are updated lessfrequently. For parallelization, we use spatial decomposi-tion. The total volume is divided into P subsystems ofequal volume, and each subsystem is assigned to a nodein an array of P compute nodes. To calculate the force onan atom in a subsystem, the coordinates of the atoms inthe boundaries of neighbor subsystems are ‘cached’ fromthe corresponding nodes. After updating the atom posi-tions due to time stepping, some atoms may have movedout of its subsystem. These atoms are ‘migrated’ to theproper neighbor nodes. With spatial decomposition, thecomputation scales as N/P, while communication scalesas (N/P)2/3. The FMM incurs an O(logP) overhead, whichis negligible for coarse-grained (N/P >> P) applications.

Algorithm 2—F-ReaxFF: Fast Reactive Force-Field Molecular Dynamics

In the past 5 years, we have developed a first principles-based reactive force-field (ReaxFF) approach to signifi-cantly reduce the computational cost of simulating chem-ical reactions (van Duin et al. 2001; Strachan et al. 2003).However, its parallelization has seen only limited suc-cess, with the previously largest ReaxFF MD involving N< 104 atoms. We have developed F-ReaxFF to enableReaxFF MD involving 109 atoms (Nakano et al. 2006;Vashishta, Kalia, and Nakano 2006). The variable N-

charge problem in ReaxFF amounts to solving a denselinear system of equations to determine atomic chargesqi|i = 1, …, N at every MD step (Rappe and Goddard1991; Campbell et al. 1999). F-ReaxFF reduces its O(N3)complexity to O(N) by combining the FMM based onspatial locality and iterative minimization to utilize thetemporal locality of the solution. To accelerate the conver-gence, we use a multilevel preconditioned conjugate-gra-dient (MPCG) method that splits the Coulomb-interactionmatrix into short- and long-range parts and uses the sparseshort-range matrix as a preconditioner (Nakano 1997).The extensive use of the sparse preconditioner enhancesthe data locality and thereby improves the parallel effi-ciency. The chemical bond order Bij is an attribute ofatomic pair (i, j) and changes dynamically adapting to thelocal environment. In ReaxFF, the potential energy EReaxFF(rij,rijk,rijkl,qi,Bij) between atomic pairs rij, tri-plets rijk, and quadruplets rijkl depends on the bond ordersof all constituent atomic pairs. Force calculations inReaxFF thus involve up to atomic 6-tuples due to chain-rule differentiations through Bij. To efficiently handle themultiple interaction ranges, the parallel F-ReaxFF algo-rithm employs a multilayer cellular decomposition schemefor caching atomic n-tuples (n = 2–6) (Nakano et al. 2006).

Algorithm 3—EDC-DFT: Embedded Divide-and-Conquer Density Functional Theory on Adaptive Multigrids for Quantum-Mechanical Molecular Dynamics

EDC-DFT describes chemical reactions with a higherquantum-mechanical accuracy than ReaxFF. The DFTproblem is formulated as a minimization of the energyfunctional EQM(rN, ψNel) with respect to electronic wavefunctions (or Kohn-Sham orbitals) ψNel(r) = ψn(r)|n = 1,…, Nel, subject to orthonormality constraints (Nel is thenumber of wave functions on the order of N; Hohenbergand Kohn 1964), The data locality principle called quan-tum nearsightedness (Kohn 1996) in DFT is best imple-mented with a divide-and-conquer algorithm (Yang 1991;Yang and Lee 1995), which naturally leads to O(N) DFTcalculations (Goedecker 1999). However, it is only in thepast several years that O(N) DFT algorithms, especiallywith large basis sets (> 104 unknowns per electron, nec-essary for the transferability of accuracy), have attainedcontrolled error bounds, robust convergence properties,and energy conservation during MD simulations, to makelarge DFT-based MD simulations practical (Fattebert andGygi 2004; Shimojo et al. 2005). We have designed anembedded divide-and-conquer density functional theory(EDC-DFT) algorithm, in which a hierarchical grid tech-nique combines multigrid preconditioning and adaptivefine mesh generation (Shimojo et al. 2005). The EDC-DFT algorithm represents the physical system as a union




of overlapping spatial domains, Ω = ∪α Ωα, and physicalproperties are computed as linear combinations of domainproperties. For example, the electronic density is expressedas ρ(r) = Σα p

α(r) Σn fnα|ψn

α(r)|2, where pα(r) is a supportfunction that vanishes outside the α-th domain Ωα, andfn

α and ψnα(r) are the occupation number and the wave

function of the n-th Kohn-Sham orbital in Ωα. Thedomains are embedded in a global Kohn-Sham potential,which is a functional of ρ(r) and is determined self-con-sistently with fn

α,ψnα(r). We use the multigrid method

to compute the global potential in O(N) time. The DFTcalculation in each domain is performed using a real-space approach (Chelikowsky et al. 2000), in which elec-tronic wave functions are represented on grid points. Thereal-space grid is augmented with coarser multigrids toaccelerate the iterative solution. Furthermore, a finer gridis adaptively generated near every atom, in order to accu-rately operate ionic pseudopotentials for calculating elec-tron-ion interactions. The EDC-DFT algorithm on thehierarchical real-space grids is implemented on parallelcomputers based on spatial decomposition. Each com-pute node contains one or more domains of the EDC algo-rithm. Then only the global density but not individual wavefunctions needs to be communicated. The resulting largecomputation/communication ratio makes this approachhighly scalable.

Acknowledgments

The work at the University of Southern California (USC)was partially supported by AFOSR-DURINT, ARO-MURI,DOE, DTRA, NSF, and Chevron-CiSoft. The work ofLHY was supported under the auspices of the U.S.Department of Energy by the University of CaliforniaLawrence Livermore National Laboratory (LLNL) undercontract No. W-7405-ENG-48. Benchmark tests wereperformed using the Columbia supercomputer at theNASA Ames Research Center, the BlueGene/L at LLNL,the NSF TeraGrid, the 5472-processor (15.8 Tflops) Linuxcluster at USC, the 11 Tflops Itanium/Opteron cluster atthe National Institute for Advanced Industrial Science andTechnology (AIST), and Linux clusters at the Universityof Tokyo and Tokyo Institute of Technology. Programshave been developed using the 2048-processor (4 Tflops)Opteron/Xeon/Apple G5 cluster at the Collaboratory forAdvanced Computing and Simulations of USC. Theauthors thank Davin Chan, Johnny Chang, Bob Ciotti,Edward Hook, Art Lazanoff, Bron Nelson, Charles Niggley,and William Thigpen for technical discussions on Colum-bia, Shuji Ogata, Satoshi Sekiguchi, Hiroshi Takemiya,and Yoshio Tanaka for their collaboration on the adaptiveQM/MD simulations on the US-Japan Grid, and BhupeshBansal, Paulo S. Branicio, and Cheng Zhang for their con-tribution on graph-based data mining.

Author Biographies

Aiichiro Nakano is a professor of computer sciencewith joint appointments in physics & astronomy, chemi-cal engineering & materials science, and the Collabora-tory for Advanced Computing and Simulations at theUniversity of Southern California. He received a Ph.D. inphysics from the University of Tokyo, Japan, in 1989. Hehas authored 225 refereed articles, including 140 journalpapers, in the areas of scalable scientific algorithms, gridcomputing on geographically distributed parallel com-puters, and scientific visualization. He is a recipient of:the National Science Foundation Career Award (1997);Louisiana State University (LSU) Alumni AssociationFaculty Excellence Award (1999); LSU College of BasicSciences Award of Excellence in Graduate Teaching(2000); the Best Paper Award at the IEEE/ACM Super-computing 2001 Conference; Best Paper at the IEEE Vir-tual Reality Conference (2002); and Okawa FoundationFaculty Research Award (2003). He is a member ofIEEE, ACM, APS, and MRS.

Rajiv K. Kalia is a professor in the Department of Phys-ics & Astronomy with joint appointments in chemicalengineering & materials science, computer science, andthe Collaboratory for Advanced Computing and Simula-tions. He graduated with a Ph.D. in physics from North-western University in 1976. His expertise is in the area ofmultiscale simulations involving atomistic, mesoscaleand continuum approaches on a GRID of distributed par-allel supercomputers and immersive and interactive vir-tual environment. He has authored over 250 papers, whichinclude ultrascale simulations of: sintering, crack growth,stress corrosion, nanoindentation, friction, and hyperve-locity impact in ceramics, nanophase composites, andnanoscale devices; oxidation and structural transforma-tions in metallic and semiconductor nanoparticles; andstructure/dynamics of self-assembled monolayers. He is arecipient of: FOM Fellowship, the Netherlands (2000);LSU Distinguished Faculty Award (1999); Sustained Excel-lence Performance Award in Ultra Dense, Ultra FastComputing Components, DARPA (1997); Japan Societyfor the Promotion of Science Fellowship Award (1991);and Brazilian Science Research Council Award (1986).

Ken-ichi Nomura is a Ph.D. candidate in the Depart-ment of Physics & Astronomy at the University of South-ern California (USC). He received an M.S. in physicsfrom Niigata University, Japan, in 2002 and an M.S. incomputer science from USC in 2006. He has developeda scalable parallel algorithm for chemically reactivemolecular-dynamics simulation and has performed mul-timillion-atom reactive MD simulations of nanostruc-tured energetic materials. He has authored 14 papers.




Ashish Sharma is a postdoctoral researcher in theDepartment of Biomedical Informatics of the Ohio StateUniversity Medical Center. He received a Ph.D. in com-puter science from the University of Southern California,Los Angeles, CA in 2005. He received an M.S. in com-puter science from Louisiana State University, BatonRouge, LA in 2002. His research interests are in scien-tific visualization, data mining, high performance com-puting, and biomedical imaging and grid technologies.He has published 14 papers including 9 journal papers. Heis a recipient of the USC Outstanding Student ResearchAward (2002) and the Best Paper at the IEEE VirtualReality Conference (2002).

Priya Vashishta received a Ph.D. in physics from theIndian Institute of Technology, Kanpur, India, in 1967.He is the founding Director of the Collaboratory forAdvanced Computing and Simulations, with multidisci-plinary appointments in the School of Engineering andCollege of Letters, Arts and Sciences, at the Universityof Southern California. Before joining USC, he was:Cray Research Professor of Computational Sciences atLouisiana State University, where he was the foundingDirector of the Concurrent Computing Laboratory forMaterials Simulations; and Senior Scientist at the ArgonneNational Laboratory, where he was the Director of theSolid State Science Division from 1979 to 1982. Hisresearch interests include high performance computingand visualization to carry out large multiscale simula-tions of materials and processes, nanoscale systems, andinfo-bio-nano interface on massively parallel and distrib-uted computers. His awards and honors include: Univer-sity of Chicago Award for Distinguished Performance atANL (1976); Japan Society for the Promotion of ScienceSenior Fellowship (1985 and 1989); Brazilian ScienceResearch Council Fellowship (1985); United NationsDevelopment Program Fellowship (1990); SustainedExcellence Performer Award in Ultra Dense, Ultra FastComputing Components, DARPA (1997); Fellow of theAmerican Physical Society (1999); and IEEE/ACM Super-computing Best Paper Award (2001).

Fuyuki Shimojo is an associate professor in the Depart-ment of Physics at Kumamoto University, Japan. Hereceived a Ph.D. in physics from Niigata University,Japan, in 1993. His main research interests are condensedmatter physics, computational materials science, and par-allel algorithms for large-scale molecular simulations. Hehas authored 132 refereed articles, including 90 journalpapers, in these areas. He is a member of APS and JPS.

Adrianus (Adri) C. T. van Duin is a Senior ResearchFellow at the California Institute of Technology, workingas Director of Force Field and Simulation Technology at

the Material and Process Simulation Center. He receiveda Ph.D. in chemistry from the Delft University of Tech-nology, the Netherlands, in 1996. He has authored over50 refereed journal papers. His main area of researchinvolves the development and application of the ReaxFFreactive force field, which has been used to simulatechemical reactions in a wide range of materials (hydro-carbons, proteins, high-energy materials, metals, metaloxides/hydrides/carbides and semiconductors). In addi-tion, he has substantial experience in the field of organicgeochemistry, including oil formation and organic/water/mineral interactions. He obtained Marie Curie and RoyalSociety Fellowships for his organic geochemical andforce field development work.

William A. Goddard, III, is Charles and Mary FerkelProfessor of Chemistry, Materials Science, and AppliedPhysics, and Director of Materials and Process Simula-tion Center (MSC) at the California Institute of Technol-ogy (Caltech). He received his Ph.D. in engineeringscience (minor in physics) from Caltech in 1964 and hasbeen on the faculty at Caltech since then. He is a memberof National Academy of Science (1984), a member ofInternational Academy of Quantum Molecular Science(1988), a fellow of American Physical Society (1988),and a fellow of American Association for the Advance-ment of Science (1990). He received the ACS Award forComputers in Chemistry (1988), the Richard M. BadgerTeaching Prize in Chemistry, Caltech (1995), the Feyn-man Prize for Nanotechnology Theory (1999), the NASASpace Sciences Award (2000), and the Richard ChaseTolman Prize from the ACS (2000). He was named by ISIas a most highly cited chemist for 1981 to 1999 and waswinner of the 2002 Prize in Computational Nanotechnol-ogy Design from the Institute for Molecular Manufacturing.He was cofounder of Molecular Simulations Inc. (nownamed Accelrys) (1984), Schrödinger Inc. (1990), SystineInc. (2001), Eidogen Inc (now Eidogen-Sertanty) (2000),Allozyne Inc. (2004), and Qateomix Inc. (2005). His cur-rent research interests include: 1) new methodology forquantum chemistry, force fields, molecular dynamics, mes-oscale dynamics, statistical mechanics; 2) applications ofatomistic simulations to chemical, biological, and materi-als systems, including catalysis, polymers, semiconductors,ceramics, and metal alloys; 3) protein structure predic-tion, drug design, incorporation of non-natural aminoacids; 4) industrial problems in catalysis, polymers, fuelcells, energetic materials, and drug design; 5) computa-tional nanotechnology with applications to nanoelectron-ics. His has over 687 research publications.

Rupak Biswas is the acting chief of the NASA AdvancedSupercomputing (NAS) Division at Ames ResearchCenter. In this capacity, he oversees the full range of




high-performance computing operations, services, andR&D for NASA's primary supercomputing center. Hereceived his Ph.D. in computer science from RensselaerPolytechnic Institute in 1991, and has been with NASAever since. He is an internationally recognized expert inparallel programming paradigms; benchmarking and per-formance characterization; partitioning and load balancingtechniques; and scheduling algorithms. He has publishedmore than 130 technical papers in journals and peer-reviewed conferences, and received many NASA awardsand two Best Paper prizes (SC99, SC2000). He served asa member of the inter-agency High End Computing Revi-talization Task Force, mission partner representative onthe DARPA High Productivity Computing Systems Project,one of five expert panelists to assess all high-end comput-ing R&D activities in Japan, and is currently a Director onthe OpenMP Architecture Review Board.

Deepak Srivastava is a senior scientist and group leadof computational nanotechnology investigations at theNASA Ames Research Center. He received a Ph.D. inphysics from the University of Florida. His research inter-ests focus on simulation of properties of nanoscale materi-als. His accomplishments include co-winner of FeynmanPrize in Nanotechnology (1997), NASA Ames CC Award(1998), Veridian Medals for Technical Paper (1999),NASA Group Excellence Award (2000), The Eric Reiss-ener Medal (2002), and CSC Award for Technical Excel-lence (2003). Dr. Srivastava serves as associate editor oftwo peer reviewed journals: Journal of Nanoscience andNanotechnology and Computer Modeling in Engineeringand Sciences, and Editorial Board Member of Compos-ites Science and Technology. He co-founded a nanotech-nology start-up company, which was bought by anotherstart-up company, Nanostellar, which focuses on rationaldesign and synthesis of nanomaterials for catalysis andenergy sector. He has chaired or co-chaired many confer-ences on nanotechnology and nanomaterials, and is afounding board member of CANEUS (Canada-Europe-US) Organization for promotion of nanotechnology foraerospace and defense.

Lin H. Yang is a physicist at Lawrence LivermoreNational Laboratory. He received a Ph.D. in physics fromthe University of California, Davis, in 1989. He hasauthored more than 80 refereed journal papers in theareas of large-scale molecular dynamics and quantummolecular dynamics simulations. His current researchfocus is to develop petascale quantum molecular dynamicssimulation algorithm for metals. He is a member ofAmerican Physical Society and currently serves as aMember at Large at the APS California Section.

References

Abraham, F. F., Walkup, R., Gao, H. J., Duchaineau, M., De laRubia, T. D., and Seager, M. (2002). Simulating materialsfailure by using up to one billion atoms and the world'sfastest computer: Brittle fracture, Proceedings of theNational Academy of Sciences of the United States ofAmerica, 99(9): 5777–5782.

Allen, G., Dramlitsch, T., Foster, I., Goodale, T., Karonis, N.,Ripeanu, M., Seidel, E., and Toonen, B. (2001). Support-ing efficient execution in heterogeneous distributed com-puting environments with Cactus and Globus, InProceedings of Supercomputing 2001, Denver: ACM.

Branicio, P. S., Kalia, R. K., Nakano, A., and Vashishta, P.(2006). Shock-induced structural phase transition, plastic-ity, and brittle cracks in aluminum nitride ceramic, Physi-cal Review Letters, 96(6): 065502.

Broughton, J. Q., Abraham, F. F., Bernstein, N., and Kaxiras, E.(1999). Concurrent coupling of length scales: Methodol-ogy and application, Physical Review B, 60(4): 2391–2403.

Campbell, T. J., Kalia, R. K., Nakano, A., Vashishta, P., Ogata,S., and Rodgers, S. (1999). Dynamics of oxidation of alu-minum nanoclusters using variable charge molecular-dynamics simulations on parallel computers, PhysicalReview Letters, 82(24): 4866–4869.

Car, R. and Parrinello, M. (1985). Unified approach for molec-ular dynamics and density functional theory, PhysicalReview Letters, 55: 2471–2474.

Chandra, R., Menon, R., Daqum, L., Kohr, D., Maydan, D., andMcDonald, J. (2000). Parallel Programming in OpenMP.San Francisco: Morgan Kaufmann.

Chelikowsky, J. R., Saad, Y., Ögüt, S., Vasiliev, I., and Stath-opoulos, A. (2000). Electronic structure methods for pre-dicting the properties of materials: grids in space, PhysicaStatus Solidi (b), 217: 173–195.

Dapprich, S., Komáromi, I., Byun, K. S., Morokuma, K., andFrisch, M. J. (1999). A new ONIOM implementation inGaussian 98. I. The calculation of energies, gradients,vibrational frequencies, and electric field derivatives, J.Mol. Struct. (Theochem), 461–462: 1–21.

Dongarra, J. J. and Walker, D. W. (2001). The quest for petas-cale computing, Computing in Science and Engineering,3(3): 32–39.

Emmott, S. and Rison, S. (2006). Towards 2020 Science, Cam-bridge, UK: Microsoft Research.

Fattebert, J.-L. and Gygi, F. (2004). Linear scaling first-princi-ples molecular dynamics with controlled accuracy, Com-puter Physics Communications, 162(1): 24–36.

Foster, I. and Kesselman, C. (2003). The Grid 2: Blueprint for aNew Computing Infrastructure: Morgan Kaufmann.

Franzblau, D. S. (1991). Computation of ring statistics for net-work models of solids, Physical Review B, 44 (10): 4925–4930.

Goedecker, S. (1999). Linear scaling electronic structure meth-ods, Reviews of Modern Physics, 71: 1085–1123.

Greengard, L. and Rokhlin, V. (1987). A fast algorithm for par-ticle simulations, Journal of Computational Physics, 73:325–348.




Gropp, W., Lusk, E., and Skjellum, A. (1999). Using MPI, 2ndedition. Cambridge, MA: MIT Press.

Gygi, F., Draeger, E., de Supinski, B. R., Yates, R. K., Fran-chetti, F., Kral, S., Lorenz, J., Ueberhuber, C. W., Gun-neis, J. A., and Sexton, J. C. (2005). Large-scale first-principles molecular dynamics simulations on the Blue-Gene/L platform using the Qbox code, In Proceedings ofSupercomputing 2005, Seattle: ACM.

Henty, D. S. (2000). Performance of hybrid message-passingand shared-memory parallelism for discrete element mod-eling, In Proceedings of Supercomputing 2000, LosAlamitos: IEEE.

Hohenberg, P. and Kohn, W. (1964). Inhomogeneous electrongas, Physical Review, 136: B864–B871.

Ikegami, T., Ishida, T., Fedorov, D. G., Kitaura, K., Inadomi,Y., Umeda, H., Yokokawa, M., and Sekiguchi, S. (2005).Full electron calculation beyond 20,000 atoms: groundelectronic state of photosynthetic proteins, In Proceedingsof Supercomputing 2005, Seattle: ACM.

Kadau, K., Germann, T. C., Lomdahl, P. S., and Holian, B. L.(2002). Microscopic view of structural phase transitionsinduced by shock waves, Science, 296(5573): 1681–1684.

Kale, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A.,Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K.,and Schulten, K. (1999). NAMD2: greater scalability forparallel molecular dynamics, Journal of ComputationalPhysics, 151 (1): 283–312.

Kendall, R. A., Apra, E., Bernholdt, D. E., Bylaska, E. J.,Dupuis, M., Fann, G. I., Harrison, R. J., Ju, J. L., Nichols,J. A., Nieplocha, J., Straatsma, T. P., Windus, T. L., andWong, A. T. (2000). High performance computationalchemistry: An overview of NWChem a distributed paral-lel application, Computer Physics Communications,128(1–2): 260–283.

Kikuchi, H., Kalia, R. K., Nakano, A., Vashishta, P., Iyetomi,H., Ogata, S., Kouno, T., Shimojo, F., Tsuruta, K., andSaini, S. (2002). Collaborative simulation Grid: multi-scale quantum-mechanical/classical atomistic simulationson distributed PC clusters in the US and Japan, In Pro-ceedings of Supercomputing 2002, Los Alamitos: IEEE.

Kohn, W. (1996). Density functional and density matrixmethod scaling linearly with the number of atoms, Physi-cal Review Letters, 76: 3168–3171.

Martyna, G. J., Tuckerman, M. E., Tobias, D. J., and Klein, M.L. (1996). Explicit reversible integrators for extended sys-tems dynamics, Mol. Phys., 87: 1117–1157.

Moon, B., Jagadish, H. V., Faloutsos, C., and Saltz, J. H.(2001). Analysis of the clustering properties of the Hilbertspace-filling curve, IEEE Transactions on Knowledge andData Engineering, 13(1): 124–141.

Nakano, A. (1997). Parallel multilevel preconditioned conju-gate-gradient approach to variable-charge moleculardynamics, Computer Physics Communications, 104: 59–69.

Nakano, A. (1999). Multiresolution load balancing in curvedspace: the wavelet representation, Concurrency: Practiceand Experience, 11: 343–353.

Nakano, A. and Campbell, T. J. (1997). An adaptive curvilin-ear-coordinate approach to dynamic load balancing ofparallel multiresolution molecular dynamics, ParallelComputing, 23: 1461–1478.

Nakano, A., Kalia, R. K., and Vashishta, P. (1999). Scalablemolecular-dynamics, visualization, and data-managementalgorithms for materials simulations, Computing in Sci-ence & Engineering, 1(5): 39–47.

Nakano, A., Bachlechner, M. E., Kalia, R. K., Lidorikis, E.,Vashishta, P., Voyiadjis, G. Z., Campbell, T. J., Ogata, S.,and Shimojo, F. (2001). Multiscale simulation of nanosys-tems, Computing in Science & Engineering, 3(4): 56–66.

Nakano, A., Kalia, R. K., Vashishta, P., Campbell, T. J., Ogata,S., Shimojo, F., and Saini, S. (2002). Scalable atomisticsimulation algorithms for materials research, ScientificProgramming, 10: 263–270.

Nakano, A., Kalia, R. K., Nomura, K., Sharma, A., Vashishta, P.,Shimojo, F., van Duin III, A. C. T., Goddard, W. A., Biswas,R., and Srivastava, D. (2006). A divide-and-conquer/cellular-decomposition framework for million-to-billionatom simulations of chemical reactions, ComputationalMaterials Science, in press.

Ogata, S., Lidorikis, E., Shimojo, F., Nakano, A., Vashishta, P.,and Kalia, R. K. (2001). Hybrid finite-element/molecular-dynamics/electronic-density-functional approach to mate-rials simulations on parallel computers, Computer PhysicsCommunications, 138(2): 143–154.

Ogata, S., Campbell, T. J., Kalia, R. K., Nakano, A., Vashishta,P., and Vemparala, S. (2003). Scalable and portableimplementation of the fast multipole method on parallelcomputers, Computer Physics Communications, 153(3):445–461.

Ogata, S., Shimojo, F., Kalia, R. K., Nakano, A., and Vashishta,P. (2004). Environmental effects of H2O on fracture initia-tion in silicon: a hybrid electronic-density-functional/molecular-dynamics study, Journal of Applied Physics,95(10): 5316–5323.

Omeltchenko, A., Campbell, T. J., Kalia, R. K., Liu, X. L.,Nakano, A., and Vashishta, P. (2000). Scalable I/O oflarge-scale molecular dynamics simulations: A data-com-pression algorithm, Computer Physics Communications,131(1–2): 78–85.

Rappe, A. K. and Goddard, W. A. (1991). Charge equilibrationfor molecular-dynamics simulations, Journal of PhysicalChemistry, 95(8): 3358–3363.

Rino, J. P., Ebbsjo, I., Kalia, R. K., Nakano, A., and Vashishta,P. (1993). Structure of rings in vitreous SiO2, PhysicalReview B, 47(6): 3053–3062.

Schlick, T., Skeel, R. D., Brunger, A. T., Kale, L. V., Board, J.A., Hermans, J., and Schulten, K. (1999). Algorithmicchallenges in computational molecular biophysics, Jour-nal of Computational Physics, 151(1): 9–48.

Shan, H. Z., Singh, J. P., Oliker, L., and Biswas, R. (2002). Acomparison of three programming models for adaptiveapplications on the Origin2000, Journal of Parallel andDistributed Computing, 62(2): 241–266.

Shan, H. Z., Singh, J. P., Oliker, L., and Biswas, R. (2003).Message passing and shared address space parallelism onan SMP cluster, Parallel Computing, 29(2): 167–186.

Sharma, A., Nakano, A., Kalia, R. K., Vashishta, P., Kodiya-lam, S., Miller, P., Zhao, W., Liu, X. L., Campbell, T. J.,and Haas, A. (2003). Immersive and interactive explora-tion of billion-atom systems, Presence-Teleoperators andVirtual Environments, 12(1): 85–95.




Shimojo, F., Kalia, R. K., Nakano, A., and Vashishta, P. (2005).Embedded divide-and-conquer algorithm on hierarchicalreal-space grids: parallel molecular dynamics simulationbased on linear-scaling density functional theory, Compu-ter Physics Communications, 167(3): 151–164.

Strachan, A., van Duin, A. C. T., Chakraborty, D., Dasgupta,S., and Goddard, W. A. (2003). Shock waves in high-energy materials: the initial chemical events in nitramineRDX, Physical Review Letters, 91(9): 098301.

Szlufarska, I., Nakano, A., and Vashishta, P. (2005). A crosso-ver in the mechanical response of nanocrystalline ceram-ics, Science, 309(5736): 911–914.

Takemiya, H., Tanaka, Y., Sekiguchi, S., Ogata, S., Kalia, R.K., Nakano, A., and Vashishta, P. (2006). Sustainableadaptive Grid supercomputing: multiscale simulation ofsemiconductor processing across the Pacific, In Proceed-ings of Supercomputing 2006, Tampa: ACM.

Tanaka, Y., Nakada, H., Sekiguchi, S., Suzumura, T., and Mat-suoka, S. (2003). Ninf-G: a reference implementation ofRPC-based programming middleware for Grid comput-ing, Journal of Grid Computing, 1: 41–51.

Truhlar, D. G. and McKoy, V. (2000). Computational chemis-try, Computing in Science & Engineering, 2(6): 19–21.

Umezawa, N., Kalia, R. K., Nakano, A., Vashishta, P., andShimojo, F. (2007). RDX (1,3,5-trinitro-1,3,5-triazine)

decomposition and chemisorption on Al(111) surface:first-principles molecular dynamics study, Journal ofChemical Physics, in press.

van Duin, A. C. T., Dasgupta, S., Lorant, F., and Goddard, W. A.(2001). ReaxFF: a reactive force field for hydrocarbons,Journal of Physical Chemistry A, 105(41): 9396–9409.

Vashishta, P., Kalia, R. K., and Nakano, A. (2006). Multimil-lion atom simulations of dynamics of oxidation of an alu-minum nanoparticle and nanoindentation on ceramics,Journal of Physical Chemistry B, 110(8): 3727–3733.

Whaley, R. C., Petitet, A., and Dongarra, J. J. (2001). Auto-mated empirical optimizations of software and the ATLASproject, Parallel Computing, 27(1–2): 3–35.

Yang, W. (1991). Direct calculation of electron density in den-sity-functional theory, Physical Review Letters, 66: 1438–1441.

Yang, W. and Lee, T.-S. (1995). A density-matrix divide-and-conquer approach for electronic structure calculations oflarge molecules, Journal of Chemical Physics, 103: 5674–5678.

Zhang, C., Bansal, B., Branicio, P. S., Kalia, R. K., Nakano, A.,Sharma, A., and Vashishta, P. (2006). Collision-free spa-tial hash functions for structural analysis of billion-vertexchemical bond networks, Computer Physics Communica-tions, 175: 339–347.



d e n o v o u l t r a s c a l e a t o m is t ic s im u l a...

Documents