pp06 abstracts · 42 pp06 abstracts tion framework coolfluid is a state-of-the art object oriented...

40 PP06 Abstracts

CP1

A Cartesian Structured AMR Framework for Par-allel Fluid-Structure Interaction Simulation

Time-accurate fluid-structure interaction simulations ofshock waves impinging on deforming solid structures ben-efit significantly from the application of dynamic meshadaptation in the fluid. We present an extension of thepatch-based parallel fluid solver framework AMROC forthis problem class. Special attention is given to a scal-able implementation of the hierarchical mesh refinementmethod, efficient parallel inter-solver communication rou-tines, and the effective transformation of the evolving solidboundary into a Cartesian level set function.

Ralf Deiterding, Sean P. MauchCenter for Advanced Computing ResearchCalifornia Institute of [email protected], [email protected]

CP1

Comparison of Parallel Decomposition Strategiesfor Turbulent Channel Flow Simulations

As we push the record of Reynolds number higher andhigher in turbulent channel flow simulations, finding ef-ficient parallel decomposition strategies becomes more andmore important. Most of large computations are now usingthe message-passing model such as MPI, where the com-munication performance is a critical factor in the overallcode performance. Conventional decomposition strategiesthat used to be adequate on small- to mid-size clusters canincur a disastrous communication cost at Tera-scale clus-ters, where typically hundreds or thousands processors areconnected through complex network topology. Typical tur-bulent channel flow simulations have two homogenous di-rections, where efficient FFTs can be used. However, usingdomain decomposition across those directions is discour-aged to minimize the communication cost. In this study,we will compare several decomposition strategies and pro-vide guidelines to achieve the best overall performance un-der various conditions.

Junwoo LimPittsburgh Supercomputing CenterCarnegie Mellon [email protected]

CP1

Parallelization of Fourier Spectral Element Code

In this paper, we implemented new parallel models forFourier Spectral Element code in order to use large numberof processors. New models can use more processors thanbefore by doing domain decomposition in two directions,while the old one only in one direction. This overcomesthe bottleneck of Fourier planes limit in parallel comput-ing. Detailed implementation and benchmark results willbe given. The new models can be run on thousands ofprocessors in practical simulation. Some results of largescale turbulent simulations using new parallel models willbe given.

Jin XuPhysics DivisionArgonne National Lab.jin [email protected]

CP2

Performance Evaluation of a 3D ElectrodynamicsSimulation on Massively Parallel Computers

We evaluate the performance for a 3D Maxwell FDTD codeon two novel massively parallel computers, the 12,288 pro-cessor QCDOC at Brookhaven National Laboratory anda 1024 node BlueGene/L at Argonne National Laboratoryand the Alpha cluster at Pittsburgh Supercomputing Cen-ter. QCDOC is the more distinctive of these architectures(e.g., higher-dimensional mesh) designed for QCD applica-tions. The FDTD application will be one of the first non-QCD applications evaluated from a scaling perspective onthis machine.

Robert E. BennettComputation Science CenterBrookhaven National [email protected]

Nicholas D’ImperioComputational Science CenterBrookhaven National [email protected]

CP2

Fast Electrostatic Force Calculation

Fast Multipole Method (FMM) and Particle Mesh Ewald(PME) are well known fast algorithms to evaluate longrange electrostatic interactions in Molecular Dynamics.However, those algorithms suffer from poor efficiency onlarge parallel machines especially for problem size under100,000 atoms. A variation of FMM based on Plane WaveExpansions is introduced and its parallel efficiency is com-pared with PME through detail time measurements.

Amirali KiaPhD StudentStanford [email protected]

Eric F. DarveStanford UniversityCenter for Turbulence [email protected]

Daejoong KimPhD StudentStanford [email protected]

CP2

Simulation of Laser Propagation in a Plasma witha Frequency Maxwell Equation

The aim is to perform numerical simulations of the propa-gation of a laser in a plasma. At each time step, one has tosolve a variable coefficient Helmholtz equation in a domainconsisting in one or two hundreds of millions of cells. Thealgorithm is based on a blend of a domain decompositionmethod with a fast variable coefficient Helmholtz solverand a QD diagonalization of an auxiliary matrix. We showsome results which are obtained on a parallel architectureat CEA-DAM (France).

Frederic NatafCMAP, Ecole PolytechniqueFrance

PP06 Abstracts 41

[email protected]

CP2

Parallel Performance of An Iterative Solver forHelmholtz Problems

We solve the linear system obtained from the discretiza-tion of the time-harmonic wave equation in a heteroge-neous medium with a Bi-CGSTAB iterative method, usinga complex shifted-Laplace operator as preconditioner. Amultigrid method is used to approximately invert the pre-conditioning operator. We investigate the parallel perfor-mance of a matrix-free algorithm for this solver and showresults for large scale problems in seismic wave propagationin the Earth in the time and frequency domain.

Kees VuikDepartment of Applied Mathematical [email protected]

Christina D. RiyantiDelft Institute of Applied MathematicsDelft University of [email protected]

Alex KononovComputational Physics Group, PCMTDelft University of [email protected]

Kess OosterleeDelft Institute of Applied MathematicsDelft University of [email protected]

CP3

Parallelisation and Scalability Issues in a MultilevelEHL Solver

The computation of numerical solutions to elastohydrody-namic lubrication problems is only possible on fine meshesby using a combination of multigrid and multilevel tech-niques. In this paper we show how the parallelisation ofboth multigrid and multilevel multi-integration for theseproblems may be accomplished and discuss the scalabilityof the resulting code. A performance model of the solver isconstructed and used to perform an anlysis of the resultsobtained. Results are shown with good speed-ups and ex-cellent scalability for both shared and distributed memoryarchitectures and in agreement with the model.

Martin BerzinsSCI InstituteUniversity of [email protected]

Christopher E. GoodyerUniversity of [email protected]

CP3

Application of Machine Learning to SelectingSolvers for Sparse Linear Systems

We demonstrate how machine learning techniques, likeboosting, can be used to select iterative linear solvers thatmatch given application characteristics. We discuss the re-search issues involved and our current implementation of

this automatic selection process. These include,creation,storage, and retrieval of a representative database; map-ping the selection problem into a binary classification prob-lem commensurate with machine learning algorithms; andsoftware to bridge between different codes and provide auser-friendly interface.

Sanjukta BhowmickDepartment of Applied Physics and Applied MathematicsColumbia [email protected]

Erika FuentesUniversity of [email protected]

Victor EijkhoutUniversity of Texas at AustinTexas Advanced Computer [email protected]

Yoav FreundDepartment of Computer Science and Engineering,University of California, San [email protected]

David E. KeyesColumbia UniversityDepartment of Applied Physics & Applied [email protected]

CP3

Performance Studies of a Parallel Global Search onSystem X

We evaluate a parallel implementation for a global searchalgorithm DIRECT on System X, a 2200 processor cluster.The parallel scheme addresses design challenges in memoryrequirement and data dependency. The performance stud-ies focus on data structure efficiency, program scalability,resource utilization, and load balancing. We analyticallyexplore the strength and weaknesses of DIRECT in thissetting, identify several key sources of inefficiency, and ex-perimentally evaluate a number of effective solutions in theparallel implementation.

Layne T. WatsonVirginia Polytechnic Institute and State UniversityDepartments of Computer Science and [email protected]

Jian HeVirginia Polytechnic Institute and State UniversityDepartment of Computer [email protected]

Alex VerstakDepartment of Computer Science,Virginia Polytechnic Institute and State [email protected]

Tom PanningVirginia Polytechnic Institute and State [email protected]

CP3

Parallelising An Existing Object-Oriented Simula-

42 PP06 Abstracts

tion Framework

COOLFluiD is a state-of-the art object oriented frame-work, written in C++. It uses advanced software tech-niques to combine modularity, flexibility and speed, andwas recently parallelized for distributed memory machines.COOLFluiD now provides transparent parallelisation forboth end-users and code developers, while still allowingcompilation without parallelisation. Input and output isfully abstracted and uses parallel IO whenever possible.This paper discusses the methods and algorithms used, andelaborates on our experiences with the framework for solv-ing engineering problems on high performance platforms.

Tiago Quintinovon Karman Institute for Fluid [email protected]

Andrea LaniAerospace Depart.von Karman Institute for Fluid [email protected]

Stefan VandewalleScientific Computing Research [email protected]

Stefaan PoedtsCentre for Plasma [email protected]

Dries KimpeScientific Computing Research [email protected]

CP4

Architecutre and Desing of Gridsat: A DistributedLarge Scale Satisfiability Solver

GridSAT is a sophisticated and complex grid applicationwhich can simultaneously run on a large set of geographi-cally distributed computational resources including severalNational Science Foundation computing centers and desk-top machines and smaller scale, locally managed clusters.GridSAT was able to solve especially hard and previouslyunsolved problems. The GridSAT solving power is cur-rently accessible to users through a simple portal inter-face at http://orca.cs.ucsb.edu/sat portal that automati-cally employs a variety of nationally distributed resourceson behalf of each user.

Wahid ChrabakhUniversity of California, Santa [email protected]

CP4

A Generative Programming Approach for Mas-ter/Slave Scientific Applications

The purpose of this work is to simplify the developmentof parallel scientific applications that follow a master/slavepattern and use the MPI library. Generative Programmingis used to provide high-level constructs which enable pro-grammers to focus on writing the specific code for the mas-ter and slaves while the generated communication detailsare hidden. We provide test cases to evaluate the program-

mer effort required, the scalability, and the performance ofapplications that use our approach.

Purushotham BangaloreUniv. of Alabama at BirminghamDept. of Computer and Information [email protected]

Francisco A. Hernandez, Kevin ReillyUniversity of Alabama at [email protected], [email protected]

CP4

Task-Based Software Development for ScientificComputing

We extend the model of task-parallel executions so that thesame coordination program can be executed on differentheterogeneous platforms. The extended model is partic-ularly suited for large applications consisting of indepen-dent modules which can be mapped onto different parts ofdistributed platforms. We show that a suitable represen-tation of the execution activities is crucial for combining aflexible multi-level specification with a dynamic schedulingthat can be adapted to a dynamically changing executionenvironment.

Thomas RauberUniversitat Bayreuth, [email protected]

Gudula RungerTU Chemnitz, [email protected]

CP4

Integrated Performance Monitoring of Highly Par-allel HPC Workloads

Integrated Performance Monitoring (IPM) is an infrastruc-ture for profiling HPC workloads. The software is de-signed to be sufficiently lightweight and easy to use thatprofiles can be taken in the course of normal productionruns. Analysis of application profiles making up a work-load yields valuable insight into parallel applications andarchitectures. IPM is open source, relies on portable soft-ware technologies and is scalable to tens of thousands oftasks.

David E. SkinnerNERSC/[email protected]

CP4

Memory Adaptation of Scientific Applications ViaAn Application-Level, Remote Memory Library

MMlib is an application-level, runtime library that controlsthe DRAM allocations of specified objects, enabling scien-tific applications to execute optimally under variable mem-ory availability. We extend MMlib so that managed objectsare not stored on disk, but on several remote servers, andare fetched transparently when needed through MPI. Ad-vances in network speeds, limited local disk on MPPs, andlarge granularity suggest the benefits and often the needfor our approach. To our knowledge, our remote memory

PP06 Abstracts 43

approach is the first entirely at the application-level.

Dimitrios NikolopoulosDept. of Computer ScienceCollege of William and [email protected]

Andreas StathopoulosCollege of William & MaryDepartment of Computer [email protected]

Chuan YueDepartment of Computer ScienceCollege of William and [email protected]

CP5

Evaluating the Implementation of An IterativeSolver in An Ocean Model

We present a tool aimed at improving the design and im-plementation of memory efficient sparse linear algebra al-gorithms through automated memory analysis. Our auto-mated memory analysis uses a language processor to pre-dict the data movement required for an iterative algorithmbased upon a Matlab implementation of that algorithm.We demonstrate how automated memory analysis is ap-plied to evaluate implementation choices in the conjugategradient solver in the Parallel Ocean Program (POP).

John M. DennisScientific Computing DivisionNational Center for Atmospheric [email protected]

Elizabeth JessupUniversity of Colorado at BoulderDepartment of Computer [email protected]

CP5

Testing Unsaturated Flow Finite Element and Fi-nite Volume Groundwater Programs on High Per-formance Computers Using Analytical Solutions

Three-dimensional steady-state and transient analytical so-lutions of Richards equation for unsaturated groundwaterflow have recently been derived, and these solutions arebeing used to test performance and accuracy of a paral-lel unstructured mesh finite element program and a struc-tured grid finite volume program. Variables in this studyinclude grid/mesh size, time step size, Picard versus New-ton iterations, and the number of processors. The analyti-cal solutions and the performance/accuracy results will bepresented.

Fred T. TracyEngineer Research and Development CenterWaterways Experiment [email protected]

CP5

Homme: A High-Performance Scalable Atmo-spheric Modeling Framework

The High Order Method Modeling Environment is a frame-work that provides the tools necessary to create a high-

performance scalable Atmospheric Model. It currentlysupports spectral element and discontinuous Galerkinschemes. We provide scaling results for a spectral elementbased 3D moist primitive equations simulation on a 10 kmhorizontal resoultion on 32K processors and a discontin-uous Galerkin based shallow-water equations problem ona 40 km horizontal resolution on 2K processors of a BlueGene/L system.

Ramachandran NairScientific Computing DivisionNational Center for Atmospheric [email protected]

Amik St-CyrNational Center for Atmospheric [email protected]

John M. Dennis, Jim EdwardsScientific Computing DivisionNational Center for Atmospheric [email protected], [email protected]

Michael LevyDepartment of Applied MathematicsUniversity of Colorado at [email protected]

Richard LoftScientific Computing DivisionNational Center for Atmospheric [email protected]

Henry TufoNCARUniversity of Colorado at [email protected]

Theron VoranDepartment of Computer ScienceUniversity of Colorado at [email protected]

Steve ThomasNational Center for Atmospheric [email protected]

CP6

A Multi-Method Ode Software Component for Par-allel Simulation of Diesel Engine Combustion

Modeling of Diesel engines requires the solution of a largenumber of large systems of ODEs arising from detailedmodels of combustion. In these models, the concentra-tions of the chemical species change with very different timescales, introducing high degrees of stiffness in the criticalphase where combustion starts. In this work we propose aparallel combustion solver based on a multi-method ODEsoftware component for effective solution of combustion insimulations of Diesel engines.

Pasqua D’AmbraInstitute for High-Performance Computing andNetworking(ICAR-CNR)[email protected]

Paola Belardini, Claudio Bertoli

44 PP06 Abstracts

Engine Institute (IM-CNR)[email protected], [email protected]

Stefania CorsaroDept. of Statistics and Mathematics for EconomicResearchUniversity of Naples ”Parthenope”[email protected]

CP6

Multiple Projection Algorithms on the Peer toPeer Lake/yml Environment

The LAKe/YML environnement is constituted by the in-tegration of the LAKe component-oriented numerical li-brary to the peer to peer YML system. YML is a frame-work allowing workflow graph description adaptable toseveral middeware. We present some multiple projectionmethods to compute a few eigen-elements of a very largenon-Hermitian matrix on LAKe/YML. The open problemsposed by these methods from the numerical and/or envi-ronmental viewpoint will be pointed out.

Olivier J. DelannoyPRISM [email protected]

Nahid EmadUniversity of Versailles, PRiSM [email protected]

CP6

Parallelization of the Diffusion Equation Methodfor Global Optimization

We use a discrete version of the Diffusion Equation Method(DEM) for global optimization to solve a problem frommagnetotelluric geoprospecting. The main computationaleffort for this application is in repeatedly evaluating the ob-jective function, which fortunately can be done in a highlyparallel manner. We discuss our multilayer parallel imple-mentation of the objective function and discrete DEM, andanalyze the performance and scalability of our approach.

Rebecca J. Hartman-BakerOak Ridge National [email protected]

Michael HeathUniversity of Illinois at [email protected]

CP6

Parallel Fem Modeling of Semiconductor Devices

This talk presents preliminary results for an unstructuredmesh stabilized finite element (FE) formulation of the driftdiffusion equations to model semiconductor devices. Abrief motivation, overview of the formulation and prelimi-nary results are presented. Performance and scaling resultsfor domain decomposition and multilevel solution meth-ods on large-scale simulations will be presented. Uniquechallenges in modeling this class of problems will be dis-cussed, including the presence of very strong source termsand gradients. Sandia is a multiprogram laboratory oper-ated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nu-clear Security Administration under contract DE-AC04-

94AL85000.

Paul LinSandia National [email protected]

Gary Hennigan, Rob Hoekstra, Joe Castro, Debbie FixelSandia National [email protected], [email protected],[email protected], [email protected]

John ShadidSandia National LaboratoriesAlbuquerque, [email protected]

Eric PhippsSandia National [email protected]

Roger PawlowskiSandia National [email protected]

CP7

Algorithm-Based Checkpoint-Free Fault Tolerancefor Parallel Matrix Computations on Volatile Re-sources

This talk presents an algorithm-based checkpoint-free faulttolerance approach in which, instead of taking checkpointsperiodically, a coded global consistent state of the criticalapplication data is maintained in memory by modifyingapplications to operate on encoded data. Although theapplicability of this approach is not so general as the typ-ical checkpoint/rollback-recovery approach, in parallel lin-ear algebra computations where it usually works, becauseno periodical checkpoint or rollback-recovery is involved inthis approach, partial process failures can often be toler-ated with a surprisingly low overhead.

Jack J. DongarraDepartment of Computer ScienceThe University of [email protected]

Julien LangouThe University of [email protected]

Zizhong ChenUniversity of Tennessee, [email protected]

CP7

Scalable Parallel Infrastructure for UnstructuredData

A new interface and implementation for unstructured par-allel data is based upon a concise, geometrically-inspiredinterface. Abstract covering relationships encode decom-positions of computational spaces into components, usingdistributed graph data structures termed Sieves. Arbi-trary data over Sieves is elegantly represented using a dualfiber-bundle-like construction. Sieve can efficiently sup-port many hierarchical algorithms (multigrid, FMM), inde-pendently of dimension, element shape, and finite element.A reference C++ implementation and several applications

PP06 Abstracts 45

are demonstrated.

Matthew G. Knepley, Dmitry KarpeevArgonne National [email protected], [email protected]

CP7

Directive-Driven Parallelization of 4GL Codes

Fourth Generation Languages, like RSI/IDL or Matlab,are very popular tools for image processing and scien-tific data analysis. Many algorithms written in these lan-guages could benefit from distributed memory parallel pro-cessing. However, writing explicit message passing codein these languages contradicts their ease-of-use and ad-hoc coding philosophy. We here present a tool to sim-plify message-passing parallelization of IDL scripts usingdirective-driven pre-processing. Supported by NASA SBIRGrant #NNG05CA85C

Peter Messmer, Seth VeitzerTech-X [email protected], [email protected]

CP7

Distributed Out-of-Core Parallel Linear Algebra onGrid5000 Heterogeneous Platform

Several experimentations already shown that data opti-mizations, such as data persistence and data migrationanticipations, are crucial to develop efficient parallel largescale linear algebra methods on heterogeneous computingGRID. Nevertheless, we also have to use out-of-core pro-gramming on each computing resource to obtain adaptedgranularities and, thus, interesting efficiencies. We presentresults obtained on the GRID5000 heterogeneous comput-ing platform for a few linear algebra methods, such as BlockGauss-Jordan and Givens-Bisection ones. As a conclusion,we propose a Disk-to-Disk (D2D) programming paradigmfor parallel linear algebra on Heterogeneous GRID.

Serge G. PetitonINRIA and LIFL, Universite de [email protected]

Lamine AouadINRIA and [email protected]

CP8

Improving the Performance of Scientific Applica-tions Involving a Large Number of Time Steps

Irregularities in scientific applications involving a largenumber of time steps, and those in the underlying com-puting systems may evolve unpredictably, requiring run-time selection of load balancing algorithms to improve ap-plication performance. Fortunately, the large number oftime steps typical in such applications provide ample op-portunities for a reinforcement learning (RL) agent to learnhow to automatically select an appropriate load balancingalgorithm during runtime. This contribution describes aframework integrating RL and dynamic load balancing inscientific applications to improve performance.

Ioana BanicescuDept. of Computer Science and EngineeringMississippi State [email protected]

Ricolindo CarinoERC Center for Computational SciencesMississippi State [email protected]

CP8

Preconditioners for Transient Problems on a Space-Time Domain

We present the results of using various preconditioners ina Newton-Krylov algorithm for solving transient problemsas boundary value problems (as opposed to initial valueproblems). In this formulation, the entire solution overthe space-time domain is solved simultaneously, with par-allelism over both the time and space domains. Precon-ditioners designed to take advantage of the structure ofJacobian matrix used in solving the resulting system willbe compared to general-purpose preconditioners.

Daniel M. DunlavySandia National LaboratoriesOptimization and Uncertainty [email protected]

Andrew SalingerComputational Sciences Department, Sandia [email protected]

CP8

Parallel Finite-Difference Time-Domain Computa-tions Aided by Modal Decomposition

High frequency electromagnetic simulation of electrically-small objects requires computationally-expensive methodssuch as FDTD. Parallel processing is one means to reducesimulation time. A shortcoming of the parallel approach isthe communication overhead that is responsible for reducedspeedup efficiency but inevitable at every time step. Weapproach FDTD from the signal processing point of viewand consider concurrency with transient and steady-stateexcitation scenarios. We rely on modal decomposition andconvolution as the techniques to aid in our research. We de-velop communication-intensive as well as communication-free implementations for each.

Dmitry Gorodetsky, Philip WilseyUniversity of [email protected], [email protected]

CP8

Application of Dimensionality Reduction Tech-niques to Time-Parallelization

We recently developed a new data-driven parallelizationstrategy that decomposes the time domain, instead of theconventional space domain. It depends on dimensionalityreduction for identifying important modes of system be-havior. We will report on the application of using thesetechniques in important molecular dynamics simulationsin materials and biology.

Lei JiDepartment of Computer ScienceFlorida State [email protected]

Yanan Yu, Ashok Srinivasan

46 PP06 Abstracts

Department of Computer Science,Florida State [email protected], [email protected]

CP9

MCell-K: A Highly Scalable Cell MicrophysiologySimulator on Blue Gene L

MCell-K is a scalable variant of MCell, a highly successfulMonte Carlo simulator of cellular microphysiology, whichuses a random walk to model reaction/diffusion systems.We present new scalability results on hundreds of proces-sors of Blue Gene and IBM Power4 systems, demonstratingexceptional scalability on a realistic simulation of neuro-transmission. We discuss performance issues, including runtime support and algorithmic changes needed to achievehigh scalability.

Gregory T. BallsCSE DepartmentUniv of California, San [email protected]

Terrence SejnowskiSalk Inst. for Biological StudiesCSE Dept., Univ of California, San [email protected]

Scott B. BadenDept of Computer Science and EngineeringUniversity of California, San [email protected]

Thomas M. BartolThe Salk [email protected]

CP9

Runtime System for Asynchronous Cellular Micro-physiology Simulation

Simulations of dynamic environments are hard to paral-lelize efficiently, because the need to preserve causality ex-acerbates the impact of load imbalance. We discuss Tar-ragon, a runtime system that offers asynchronous data-driven execution, and supports relaxed synchronizationand dynamic load balancing. Tarragon’s API hides lowerlevel system dependent details, taking advantage of anyarchitectural features that can support its data driven ex-ecution model. We discuss recent results with a cell micro-physiology simulator.

Pietro CicottiUniversity of California, San DiegoDepartment. Computer Science and [email protected]

Scott B. BadenUniversity of California, San DiegoDept of Computer Science and [email protected]

CP9

Large Scale Parallel Implementation of Replica Ex-change on Blue Gene/L

Replica exchange is an enhanced sampling algorithm usedin thermodynamic simulations of proteins and other sys-

tems. We will present an efficient and scalable implementa-tion of replica exchange that is tuned to exploit the physicalnetwork topology and hardware features of BlueGene/L.Performance measurements of the Blue Matter MolecularDynamics application using the replica exchange methodon thousands of nodes of Blue Gene/L will be presented.

Blake Fitch, Maria EleftheriouIBM Thomas J. Watson Research [email protected], [email protected]

Jed M. PiteraScience and Technology, IBM Almaden Research Center,San [email protected]

Aleksandr Rayshubski, Robert S. Germain, Ruhong ZhouIBM Thomas J. Watson Research [email protected], [email protected],[email protected]

CP9

Strong Scaling of Biomolecular Simulation on BlueGene/L

To support simulations of proteins and membranes overlong time scales, we have developed a molecular dynam-ics code, Blue Matter, that also serves as a test bed forexploitation of Blue Gene/L hardware features and for ex-plorations of parallel decompositions and algorithms suit-able for massively parallel machines. We will describe theparallel decomposition and distributed 3D-FFT that haveenabled us to achieve continued speedup through 16,384Blue Gene/L nodes on a 43,222 atom system.

Blake Fitch, Aleksandr Rayshubskiy, Maria Eleftheriou,Michael PitmanIBM Thomas J. Watson Research [email protected], [email protected],[email protected], [email protected]

Chris WardIBM Software [email protected]

Mark Giampapa, Frank Suits, Robert S. GermainIBM Thomas J. Watson Research [email protected], [email protected],[email protected]

CP10

High Order Solutions to Self-Adjoint Elliptic Equa-tions by Reduction to Constant Coefficients

We develop a solver for variable coefficient, self-adjoint,elliptic equations. The solution of an auxiliary constantcoefficient equation induces a reduction to constant coef-ficient equations plus small inhomogeneous deviations. Ahighly accurate, fast, Fourier-spectral algorithm can solvessuch equations. A small number of steps achieves high ac-curacy. The procedure becomes more efficient as the size ofthe domains of solution decreases. A highly parallelizable,hierarchical, decomposition and reconstruction proceduregives an accurate global solution.

Moshe IsraeliTechnion Institute of Technology, [email protected]

PP06 Abstracts 47

CP10

Parallel Domain Decomposition Methods for SomeStochastic

We present a parallel multilevel domain decompositionpreconditioned recycling Krylov subspace method for thenumerical solution of some elliptic partial differentialequations with stochastic uncertainties in the operator.Karhunen-Loeve expansion and finite element method withdouble orthogonal polynomial basis are used to transformthe stochastic problem into a sequence of deterministicequations. A PETSc based parallel software is developedfor a recycling domain decomposition preconditioned iter-ative method and its parallel scalability will be reported inthis talk.

Chao Jin, Congming LiDept. of Applied MathematicsUniversity of Colorado, [email protected], [email protected]

Xiao-Chuan CaiUniversity of Colorado, BoulderDept. of Computer [email protected]

CP10

New Scalability and Accuracy Results for aLatency-Tolerant Elliptic Solver

We present new scalability and accuracy results for asecond-order accurate elliptic free space solver. Employinga method of local corrections, we reduce communicationcosts by representing far-field effects at a coarser resolu-tion. Improvements in parallel scalability for the globalcoarse calculation and the introduction of adaptivity allowus to scale efficiently to thousands of processors. Numer-ical overheads incurred are independent of the number ofprocessors for a wide range of problem sizes.

Peter MccorquodaleLawrence Berkeley National [email protected]

Scott B. BadenDept of Computer Science and EngineeringUniversity of California, San [email protected]

Phillip ColellaLawrence Berkeley National [email protected]

Gregory T. BallsCSE DepartmentUniv of California, San [email protected]

CP10

The Addition of Hp-Adaptivity to a Parallel Adap-tive Finite Element Program

The hp version of the finite element method is an adaptivefinite element approach in which adaptivity occurs in boththe size, h, of the elements and in the order, p, of theapproximating piecewise polynomials. The objective is todetermine a distribution of h and p that minimizes the errorusing the least amount of work in some measure. In this

talk we discuss the addition of hp-adaptivity to PHAML,a parallel adaptive multigrid finite element program.

William F. MitchellNational Institute of Standards and [email protected]

CP11

Multiresolution Analysis and Implicit DeferredCorrection on Parallel Computers

We describe the application of the deferred correctionprocedure to a multiwavelet based multiresolution analy-sis framework for solving integro-differential equations forquantum electronics structures and time dependent den-sity functional theory computations. The deferred correc-tion method is implemented in the multiresolution solverpackage Multiresolution Adaptive Evaluation for ScientificSimulations (MADNESS). We also report on preliminaryparallel performance on the Cray-XT3, the Cray XD-1 andthe IBM Blue Gene/L machines.

George FannOak Ridge National [email protected]

CP11

Scalability Study of Accelerator Simulations onLarge-Scale Clusters

In this paper, we present a detailed scalability analysis ofadvanced accelerator simulations on various TOP500 sys-tems at NCSA. In particular, we study the performanceof Synergia, a state-of-the-art accelerator modeling codefunded by the DOE SciDAC program. The goal is to under-stand the performance and scaling behavior of acceleratorsimulations on different CPU architectures and messagelayers and further to investigate the performance bottle-necks associated with the existing accelerator simulations.

Zhiling Lan, Jungmin LeeIllinois Institute of [email protected], [email protected]

Panagiotis SpentzourisFermi National [email protected]

James AmundsonFermi National Accelerator [email protected]

CP11

A Parallel Software Framework for Solving Radia-tive Heat Transfer Problems by the Photon MonteCarlo Method

Radiative heat transfer is extremely difficult to model be-cause of its highly nonlocal and nonlinear nature. In thistalk we present a parallel software framework that imple-ments the photon Monte Carlo method of ray tracing tosimulate radiative effects. We demonstrate the scalabilityof this framework and address load balancing issues in-herent with this approach. The framework allows for theincorporation of user-specified radiative properties whichenables its use within a variety of combustion applications.

Paul E. Plassmann

48 PP06 Abstracts

Virginia [email protected]

Ivana VeljkovicPenn State UniversityDepartment of Computer Science and [email protected]

CP12

Decomposing Solution Sets of Polynomial Systemsin Parallel

Consider the numerical irreducible decomposition of a pos-itive dimensional solution set of a polynomial system intoirreducible factors. With path tracking techniques we cantrack loops around singularities and may connect points onthe same irreducible components. The computation of alinear trace for each factor certifies the decomposition. Us-ing the concepts of monodromy and linear trace, we presenta new parallel monodromy breakup algorithm. Our imple-mentation is written in C using MPI and is linked to thelibrary of PHCpack, a software package for polynomial ho-motopy continuation. It has exhibited good speedups andthe fastest decomposition times on several classical bench-marks, especially on solution sets of large degree.

Anton LeykinUniversity of Illinois at [email protected]

Jan VerscheldeDepartment of Mathematics, Statistics and ComputerScienceUniversity of Illinois at [email protected]

CP12

Parallel Space-Filling Curve Generation for Dy-namic Load Balancing

Adaptive mesh refined codes require a dynamic load bal-ancer. Load balancers based on space-filling curves cangenerate reasonable partitions but it has not been clearhow to generate the space-filling curves in parallel. Wehave developed a parallel merge sort method to generatethese curves that scales well up to thousands of processors.Complexity analysis and experimental results are given.

Martin BerzinsSCI InstituteUniversity of [email protected]

Justin P. Luitjens, Tom HendersonUniversity of [email protected], [email protected]

CP12

An Improved Bi-Level Algorithm for Load-Balancing Structured Refinement Hierarchies

Simulations on evolving structured hierarchical meshestypically suffer from poor parallel efficiency due to loadimbalances. We analyse this problem with a set of testconfigurations derived from AMROC. In particular, we dis-cuss a singularly converging Richtmyer-Meshkov instabilityfor which the native load balancer partially exhibits imbal-ances greater than 100 percent. This presentation proposes

an improved bi-level partioning algorithm that automat-ically switches between two approaches and reduces theimbalances to less than 30 percent.

Johan SteenslandSandia National [email protected]

Ralf DeiterdingCenter for Advanced Computing ResearchCalifornia Institute of [email protected]

Jaideep RaySandia National [email protected]

CP12

A Scalable Approach for Adaptive EmbeddedBoundary Cartesian Grid Generation

This talk describes an adaptive patch-based AMR schemefor complex geometry embedded boundaries where boththe flow solution and the grid generation are done com-pletely in parallel. The grid generation steps are dis-tributed across the processors in a load-balanced way.The work integrates the SAMR package SAMRAI and theCartesian adaptive grid generator in Cart3D. We demon-strate scaling performance using a model of aerosol disper-sion in urban environments. We also discuss methods fordetermining grid generation costs and how this was usedfor optimizing load balance.

Andrew M. WissinkCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

Marsha BergerCourant Institute of Mathematical SciencesNew York [email protected]

CP13

Distributed Memory Parallel Implementation ofthe Block Shift and Invert Lanczos Algorithm

The Block Shift and Invert Lanczos Algorithm is theworkhorse of commercial finite element analysis packagesfor performing vibration and buckling analyses. Eigenanal-ysis problems are reaching 50 million rows in these appli-cations. This paper will provide an overview of what webelieve to be the first distributed memory parallel imple-mentation of this algorithm for use in a commercial finiteelement package LS-DYNA.

Roger GrimesLivermore Software Technology [email protected]

Bob LucasISIUniv. of Southern [email protected]

CP13

Block Locally Optimal Preconditioned Eigenvalue

PP06 Abstracts 49

Xolvers

Block Locally Optimal Preconditioned Eigenvalue Xolvers(BLOPEX) is a package, written in C, that at presentincludes only one eigenxolver, Locally Optimal BlockPreconditioned Conjugate Gradient Method (LOBPCG).BLOPEX supports parallel computations through an ab-stract layer. BLOPEX is incorporated in the HYPREpackage from LLNL and is availabe as an external blockto the PETSc package from ANL as well as a stand-aloneserial library. Main LOBPCG features: a matrix-free it-erative method for computing several extreme eigenpairsof symmetric positive generalized eigenproblems; a user-defined preconditioner; robustness with respect to randominitial approximations, variable preconditioners, and ill-conditioning of the stiffness matrix; apparently optimalconvergence speed. Numerical comparisons suggest thatLOBPCG may be a genuine block analog for eigenproblemsof the standard preconditioned conjugate gradient methodfor symmetric linear systems.

Merico ArgentatiUniversity of Colorado at [email protected]

Andrew Knyazev, Ilya [email protected],[email protected]

Evgueni OvtchinnikovU. Westminster, [email protected]

CP13

An Efficient Cluster-Based Parallel FFT Algorithm

The paper presents a parallel radix-2 Cooley-Tukey FFTalgorithm on a cluster. A new data distribution method isinvented and parallelized to distribute data points amongprocessors such that no message passing is needed by thebit-reversal step in which each processor only performs thebit-reversal on its local vector with a fast linear-time re-cursive bit-reversal algorithm. The experimental resultshows that the parallel FFT algorithm achieves significantspeedup over the serial Stockham FFT algorithm.

Jingyu ZhangUniversity of Illinois at SpringfieldDepartment of Computer [email protected]

Michael StultsLegal Files Software Inc.Springfield, IL [email protected]

CP14

A Parallel Priority Queueing System with Balkingand Reneging

We pose and analyze a novel model of a priority queueingsystem with balking and reneging that is designed to mini-mize the idleness of the system. An algorithmic solution forthe general two-station case and closed-form solutions forspecial cases of this model are given. We find distributionof the queue length; probabilities of the system idleness,lost to the system, and stations being busy; and the first

two moments. Numerical examples are also given.

Aliakbar M. Haghighi, Dimitar MishevPrairie View A&M [email protected], [email protected]

CP14

Feedback-Based Guided Self-Scheduling

Parallel loops account for the greatest amount of paral-lelism in numerical and scientific applications. Schedulingparallel loops, i.e., the way iterations are mapped on todifferent processors, plays a critical role in the efficient ex-ecution of such applications, on multiprocessor systems. Inapplications where problem dimension (and hence execu-tion time) is dependent on run-time data, loop iterationsalso tend to be of variable length – this variability affectsboth sequential and parallel loops and in particular nestedloops and it is quite prevalent in sparse matrix solvers.In this work, we propose a feedback-directed approach fordynamic scheduling of irregular parallel loops on hetero-geneous multiprocessor systems. Results show that ourtechnique performs (upto) 18% better than the existingdynamic scheduling schemes.

Arun Kejariwal, Alexandru NicolauUniversity of California at Irvinearun [email protected], [email protected]

Constantine PolychronopoulosUniversity of Illinois at [email protected]

CP14

A Data Driven Model for Tolerating Communica-tion Delays

Hiding communication latency is a difficult challenge due tothe high software overheads incurred. We discuss Thyme, adata-driven programming model and run time library thatmanages communication pipelining and scheduling throughtask graph, actor-like execution. Thyme frees the userfrom many low-level system dependent details involved inmasking communication, easing the reformulation of ap-plications to tolerate delays. We demonstrate results withsome practical applications and discuss compiler supportusing the ROSE source-to-source translator.

Scott B. BadenUniversity of California, San DiegoDept of Computer Science and [email protected]

Jacob SorensenUniversity of California, San DiegoComputer Science and Engineering [email protected]

MS1

Cache-Oblivious Algorithms

Cache-oblivious algorithms and data structures areplatform-independent. They run efficiently on a hierarchi-cal memory, even though they avoid any memory-specificparameterization, such as cache-block sizes or access times.We give an overview of the work on designing cache-oblivious algorithms and data structures. We then givetheoretical and experimental results on searching and al-

50 PP06 Abstracts

gorithmic problems. Finally, we outline future directionsin the area.

Michael BenderStony Brook [email protected]

MS1

Adaptive Programming with Hierarchical Multi-processor Tasks

The talk considers the programming with hierarchicallystructured multiprocessor tasks (M-tasks). An M-task canbe mapped onto a group of processors and can be executedconcurrently to other independent tasks. Internally, an M-task can consist of a hierarchical composition of smaller M-tasks or can incorporate different kinds of parallelism. De-pending on the structure of the application, significant per-formance improvements can be obtained. As examples weconsider specific ODE solvers and hierarchical algorithmsfor matrix multiplication.

Thomas RauberUniversitat Bayreuth, [email protected]

Gudula RungerTU Chemnitz, [email protected]

MS1

Adaptive Algorithms: Theory and Applications

The minisymposium is firstly introduced by a taxonomy ofadaptive methods in scientific computing. Then, the talkfocuses on adaptive parallel computing on non-uniform andshared resources. Minimizing the completion time dependson two antagonists complexity measures: number of oper-ations and critical time. We present a dynamic cascadingtechnique based on the coupling of two algorithms, sequen-tial and parallel fine-grain. It is illustrated by practicalapplications in combinatorial optimization and linear alge-bra.

Van Dat CungGILCO Lab., INPG Grenoble, [email protected]

Guillaume HuardID Laboratory (Grenoble, France)supported by UJF / CNRS / INRIA / [email protected]

Jean-Louis RochMOAIS Project (CNRS-INRIA)ID-IMAG Lab., Grenoble, [email protected]

Denis TrystramMOAIS ProjectID-IMAG lab. INPG [email protected]

Jean-Guillaume DumasMNC-IMAG, UJF [email protected]

Thierry Gautier

MOAIS Project, ID-IMAG, [email protected]

Bruno RaffinMOAIS Project, ID-IMAG LabINRIA Rhone-Alpes, [email protected]

Christophe RapineLab GILCO, INP Grenoble, [email protected]

MS1

Hybrids in Exact Linear Algebra

For a remarkable range of types and sizes of integer andrational matrix problems, there is now sophisticated highperformance software. Among those problems are linearsystem solution, nullspace basis, rank, determinant, andcharacteristic polynomial. To reach high performance,engineered hybrids of multiple algorithms (close-to-the-hardware BLAS, iterative and exact methods) play an in-creased role. We will discuss hybrids in more detail, illus-trating several ways to gain and revealing some dilemmastoo.

Dumas Jean-GuillaumeLMC-IMAG Grenoble [email protected]

David SaundersDept of Computer and Information SciencesUniversity of Delaware, [email protected]

MS2

Identification of Severe Multiple Contingencies inElectric Power Networks

The August 14, 2003 blackout in the Northeast has empha-sized on the need to achieve improved operational efficiencyand reliability standards for electric power systems. Suchattempts are however hindered by the lack of advancedcomputational tools to better perform security evaluations.Present practices and policies are limited by the inabilityto evaluate multiple simultaneous outages, and as observedby 2003 Northeast blackout, the impact of triple contingen-cies can be enormous. This talk will focus on a two-stagescreening and analysis process for identifying multiple con-tingencies that may result in very severe disturbances andblackouts. In a screening stage, an optimization problem isformulated to find the minimum change in the network tomove the power flow feasibility boundary to the present op-erating point, which will cause the system to separate witha user-specified power imbalance. The lines identified bythe optimization program are used in a subsequent analy-sis stage to find combinations that may lead to a blackout.Application of this approach to the IEEE 30-bus systemand 179-bus system that yields encouraging results will bepresented.

Bernard [email protected]

Vanessa LopezComputational Research DivisionLawrence Berkeley National Laboratory

PP06 Abstracts 51

[email protected]

Vaibhav DondeEnvironmental Energy Technologies DivisionLawrence Berkeley National [email protected]

Chao YangLawrence Berkeley National [email protected]

Juan C. MezaLawrence Berkeley National [email protected]

Ali PinarLawrence Berkeley [email protected]

MS2

Computational Challenges in the Power Industry

Not available at time of publication.

Lawrence JonesTransmission Optimization & Partnering (TOP) ProgramAREVA T&[email protected]

MS2

A Social Intelligence Approach for Identifying theMost Effective Locations for Terrorist Attack in theNational Power Network

This presentation focuses on research related to the iden-tification of critical nodes within the national bulk powersystem. These are nodes that, if disrupted through naturalevents or terrorist action would cause the most widespread,immediate damage. A number of traditional methodswere investigated to identify these nodes: integer-basedoptimization, genetic algorithms, network theory (smallworld), polyhedral dynamics, and cultural-based methods.In addition, a number of modeling approaches were inves-tigated including linear (dc) and non-linear (ac) power flowmodels. The final approach is based on the use of an acpower flow model coupled with an adaptive cultural modeland collective intelligence as a means of characterizing thereliability of bulk power networks. The method involvesa collective of virtual power engineers acting as a cell ofpotential terrorists attempting to attack the bulk powersystem in the most optimal manner. The methodology istechnology independent: it can be applied not only for re-liability analysis of bulk power systems, but also other en-ergy systems or transportation systems or a combinationof these.

David RobinsonSandia National [email protected]

MS2

Using Stochastic Partitioning Algorithms for Con-tingency Identification in Electric Power Systems

Several failure analysis and mitigation tasks for electricpower networks require identification of network partitionswith either large or zero generation-load imbalance. Oneunique feature of these tasks is the need for finding not

only the extremal partition (i.e., the one with the largest orsmallest absolute imbalance) but also other partitions thatare near the extreme. Here, we advance the perspectivethat stochastic partitioning algorithms - including a vari-ant of Karger’s min-cut algorithm and an influence model-based algorithm - are well-suited to search for all partitionswith large or small generation-load imbalance.

Sandip RoyWashington State UniversitySchool of Electrical Engineering and Computer [email protected]

MS3

Code Coupling in the Center for Simulation ofWaves in Magnetohydrodynamics

CSWIM is a DoE project for integrating different physicscodes together for better simulation of magnetically con-fined plasmas. Physics, mathematics, high performancecomputing, and software systems must be addressed in atightly coupled manner. The codes model the same re-gion of space but with different time-scales, discretizations,symmetries, and dimensionality reduction for an inherently7-dimensional problem. This talk is an overview of CSWIMproblems shared by many related fusion energy simulationintegration efforts.

Randall B. BramleyIndiana [email protected]

MS3

Not Available at Time of Publication

Abstract not available at time of publication.

Phil ColellaLawrence Berkeley National [email protected]

MS3

An Overview of The Distributed Coupling Toolkit(DCT): Functionality and Applications

The Distributed Coupling Toolkit (DCT) offers a func-tional interface for flexible and dynamic coupling. Amongkey features of DCT are its mechanisms for handling cou-pling in a distributed manner thus exploiting scalabilityin large parallel systems. In DCT there is no coupler-process(or) or centralized control. Instead coupling is for-mulated in a more natural way to multiphysics models.Here, we present basic DCT functionality, scientific appli-cations deploying DCT, and comparisons with centralizedcoupling approaches.

Leroy A. DrummondComputationa Research DivisionLawrence Berkeley National [email protected]

MS3

The Model Coupling Toolkit and Parallel CouplingInfrastructure

Parallel multiphysics and multiscale models have inherentmutual parallel data exchanges, leading to a parallel cou-pling problem (PCP). The architectural and data process-

52 PP06 Abstracts

ing parts of the PCP lead to a set of requirements we usedto create a performance-portable package–the Model Cou-pling Toolkit–to ease construction of bespoke solutions tothe PCP. We will present a roadmap for development ofa more general parallel coupling infrastructure (PCI) thatwill address more general parallel coupling problems.

Jay W. LarsonMath and Computer Science DivisionArgonne National [email protected]

MS4

A Quantitative Characterization of the Structure ofToolkits Designed Using the Common ComponentArchitecture Paradigm

Component-based software designs, when used for devel-oping codes for scientific simulation, are supposed to en-hance productivity by subsetting software complexity intomodular components and promoting component resuse.We investigate this supposition quantitatively by devisingvarious metrics and evaluating them based on statisticsdrawn from two toolkits (for combustion and computa-tional chemistry simulations), both of which have adoptedthe Common Component Architecture paradigm. We alsoinvestigate how such modularization enables novel tech-niques such as adaptive codes.

Benjamin Allan, Joseph KennySandia National [email protected], [email protected]

MS4

The Cactus Framework: Design, Applications andFuture Directions

The Cactus Code is an open source, modular frameworkfor collaborative high-performance computing. In addi-tion to including different computational and communitytoolkits, Cactus provides a range of abstract interfaces toapplication developers through which a variety of thirdparty libraries and tools, including HDF5, PAPI, SAM-RAI, PETSc, and Globus, can be easily leveraged. In thistalk, we review the design of Cactus, describe how it is usedby different application communities, and discuss ongoingdevelopment plans.

Gabrielle AllenCenter for Computation & TechnologyLouisiana State University, Baton Rouge, LA [email protected]

MS4

An Introduction to Framework-Based Approachesfor Parallel Scientific Computing

Research in parallel scientific computing continues toincrease in complexity, as multi-physics, multi-scale,and multi-institutional projects are becoming widespread.Component-based frameworks aim to address many ofthe software and social challenges that arise in incorpo-rating the combined expertise of diverse groups in suchprojects. This presentation will introduce the main con-cepts of framework-based approaches and will highlight theuse of parallel numerical components in several scientificapplications.

Lois Curfman McInnes

Argonne National [email protected]

MS4

Enabling Cross-Organization Coupling of ClimateModels with the Earth System Modeling Frame-work

Cross-organization coupling of climate models is one effi-cient way of discovering new science, increasing forecastingskills, and validating models. NASA’s Earth-Sun SystemTechnology Office/Computational Technologies Projecthas funded development of the Earth System ModelingFramework (ESMF) to support cross-organization cou-plings within a standardized software environment. Thistalk will discuss the ESMF, which so far has been usedto couple major climate models from NASA, NOAA,NSF/NCAR, DOE/LANL, MIT, and UCLA.

Shujia ZhouSoftware Integration and Visualization OfficeNASA Goddard Space Flight [email protected]

MS5

Software-Directed Disk Power Minimization forScientific Workloads

Energy-consumption has become a very important issuefor high-performance machines that target scientific work-loads. In particular, disk power consumption of data-intensive scientific applications stands as one of the majorproblems for application designers and system maintain-ers. While most of the prior approaches to disk powerminimization focused exclusively on hardware based tech-niques, our work explores the possibility of employing anoptimizing compiler to control disk power. In this work, westudy both data layout organization and code restructur-ing techniques for the single processor and multi-processorcases. We also compare the performance of our approachto pure hardware-based schemes.

Mahmut KandemirDept of Computer Science Engr.The Pennsylvania State [email protected]

Seung Son, Guilin ChenComputer Science & Engg.The Pennsylvania State [email protected], [email protected]

MS5

Maximizing Multinode Performance on MassivelyParallel Computers

This talk concerns recent progress in exploring the fron-tiers of supercomputing including architecture, softwareand tools for massively parallel scientific computing withlow cost, space and power consumption. We will discusshow a large number of low power processors, multiple inte-grated interconnection networks and a novel system soft-ware architecture enables applications from a broad varietyof scientific disciplines to effectively scale to tens of throu-sands of processors.

Jose MoreiraIBM Research

PP06 Abstracts 53

[email protected]

MS5

High-Throughput Low Power Scientific Computing

In this talk we explore how recent advances in high-throughput architectures that concentrate on maximiz-ing the amount of thread-level parallelism at the ex-pense of instruction-level parallelism can achieve very high-performance on scientific applications. This style of archi-tecture, which was originally targeted at commercial serverapplications, can achieve high-performance while maximiz-ing the performance/watt on a wide range of scientific ap-plications. We expect these high-throughput architecturesto dramatically reduce the power requirements of scientificcomputing.

Kunle OlukotunElectrical Engg. and Computer ScienceStanford [email protected]

James LaudonSun [email protected]

MS5

Energy-Aware Architectural Optimizations forFast Sparse Scientific Computations

We utilize current trends in power-aware high-performancearchitecture design to explore the impact of memory sub-system optimizations on sparse scientific codes. We pro-pose a variety of features such as memory and cacheprefetching and load miss prediction to enable highperformance, sparse scientific applications at reducedpower/energy levels. We will discuss the use of cycle-accurate simulations with tools such as SimpleScalar andWattch to characterize performance and power profiles.Our results for an optimized sparse-matrix-vector multipli-cation kernel and two sparse kernels from the NAS bench-mark indicate that in certain cases, our optimizations canreduce application time by 80%, and the energy-time prod-uct by 90%, at system power levels less than or equal tothose in the original configuration.

Mary Jane IrwinCompter Science & EngineerinPennsylvania State [email protected]

Ingyu LeeComputer Science & EngineeringThe Pennsylvania State [email protected]

Konrad MalkowskiPenn State [email protected]

Padma RaghavanComputer Science & EngineeringThe Pennsylvania State [email protected]

MS6

Performance Evaluation of the Columbia Constel-lation

Since its debut in spring of 2002, the Earth Simulator (ES)has attracted worldwide attention – occupying the num-ber one spot on the Top500 for a record two and a halfyears. The ES uses a dramatically different architecturalapproach than most conventional cache-based supercom-puters, utilizing powerful vector processors connected via afast single-stage crossbar. In this talk, we examine ES per-formance on several ultra-scale scientific applications anddemonstrate the tremendous potential of this architecturalparadigm.

Rupak BiswasNASA Ames Research [email protected]

MS6

An Evaluation of MareNostrum Performance

This talk presents an evaluation of the MareNostrum Pow-erPC cluster from the individual components to the finalperformance delivered for real applications. Beyond justreporting the achieved performance metrics, the talk aimsat developing models of the machine and applications. Theresults of microbenchmarks and parameter fits based onsimulations with Dimemas are used to generate the mod-els. Such characterizations prove very useful to properlypredict and understand the observed behaviour and ex-trapolate expectations.

Rosa Badia, Gerard Rodriguez, Jesus LabartaBarcelona Supercomputing [email protected], [email protected], [email protected]

MS6

Leading Computational Methods on the EarthSimulator

Since its debut in spring of 2002, the Earth Simulator (ES)has attracted worldwide attention – occupying the num-ber one spot on the Top500 for a record two and a halfyears. The ES uses a dramatically different architecturalapproach than most conventional cache-based supercom-puters, utilizing powerful vector processors connected via afast single-stage crossbar. In this talk, we examine ES per-formance on several ultra-scale scientific applications anddemonstrate the tremendous potential of this architecturalparadigm.

Leonid Oliker, Jonathan CarterLawrence Berkeley National [email protected], [email protected]

MS6

Scalable Application Performance on BG/L, aTruly Massively Parallel Platform

Unprecedented levels of performance have been achieved onBlueGene/L (BG/L) designed by IBM Research in partner-ship with the Advanced Simulation and Computing (ASC)Program, part of the U.S. Department of Energy’s NationalNuclear Security Administration. The order of magni-tude improvement in performance was enabled by an orderof magnitude increase in processors. This unprecedentedscale has presented unique challenges for applications, from

54 PP06 Abstracts

diverse domains such as molecular dynamics and breadth-first search. This talk will present results for several ap-plications, many of which have achieved record levels ofperformance, as well as the BG/L system architecture.

Bronis de SupinskiLawrence Livermore National [email protected]

MS7

Nonlinear Adaptive Smoothed Aggregation Multi-grid

The increasing demands of large-scale complex solid me-chanics simulations are placing greater emphasis on thechallenges associated with the efficient solution of the setof nonlinear equations. Here, the solution of large sys-tems of equations with material, geometric and contactnonlinearities in a parallel framework is adressed. Insteadof applying widely used Newton- or Newton-Krylov typemethods that involve the derivation of a stiffness matrixand a sequence of linear solves, the presented work de-tails the implementation of a variational nonlinear alge-braic multigrid algorithm based on the full approximationscheme (FAS) applied to solve this set of nonlinear equa-tions [2]. The multigrid hierachy is constructed using anadaptive smoothed aggregation multigrid approach [1],[5]applied to a fine grid Jacobian that will also be used toconstruct nonlinear smoothers on all grids. As the varia-tional evaluation of the nonlinear residual function plays acrucial role for the cost of the overall method, reducing thenumber of residual evaluations by encountering the extraeffort of a (linear) adaptive smoothed aggregation proce-dure can reduce overall solution times significantly. The al-gorithm is implemented within Sandia National Laborato-ries’ freely available parallel ’Trilinos’ linear algebra frame-work[3] and makes use of its smoothed aggregation multi-grid library ’ML’ [4] and its nonlinear solver library ’NOX’.The outline of the algorithm and implementation are giventogether with examples demonstrating the advantages ofthis new approach. Several variants of the algorithm willbe discussed and compared. References [1] Brezina, M. etal., ”Adaptive Smoothed Aggregation (αSA) Multigrid”,SIAM Review, 47, 317-346, 2005. [2] Gee, M.W., Tumi-naro, R.S., ”Nonlinear algebraic multigrid for constrainedsolid mechanics problems using Trilinos”, technical report,Sandia National Laboratories. 2005. [3] Heroux, M.A.,Willenbring, J.M., ”Trilinos Users Guide”, Sandia ReportSAND2003-2952, 2003. [4] Sala, M., Hu, J.J., Tuminaro,R.S., ”ML 4.0 Smoothed Aggregation User’s Guide”, San-dia Report SAND2004-2195, 2004. [5] Vanek, P., Mandel,J., Brezina, M., ”Algebraic Multigrid by Smoothed Aggre-gation for Second and Fourth Order Elliptic Problems”,Computing, v. 56, p. 179-196, 1996.

Ray S. Tuminaro, Michael W. Gee, Kendall H. PiersonSandia National [email protected], [email protected],[email protected]

MS7

Smoothed Aggregation Based on Energy Optimiza-tion and Aggressive Coarsening

Algebraic multigrid (AMG) based on smoothed aggrega-tion (SA) was originally developed around the central ideaof minimizing the energy of interpolation operators thatalso interpolate certain null space vectors exactly [1]. Man-del, Brezina, and Vanek later proposed a more general

method that seeks to minimize the energy of the inter-polation operators by explicitly solving a constrained op-timization problem, subject to a specified nonzero patternand certain null space interpolation constraints [1]. In thistalk we will begin with an overview of the SA approach toAMG. We will then explore extensions of the energy op-timization method to SA with aggressive coarsening (i.e.,larger aggregates) and explicit detection of anisotropies.We will end with parallel numerical results demonstrat-ing the effectiveness of the AMG preconditioning. [1] P.

Vanek, J. Mandel, and M. Brezina, ”Algebraic multigridbased on smoothed aggregation for second and fourth or-der problems.” Computing, 1996 (56), p. 179-196. [2] J.Mandel, M. Brezina, and P. Vanek, ”Energy Optimizationof Algebraic Multigrid Bases”. Computing, 1999 (62), p.205-228.

Jonathan J. HuSandia National LaboratoriesLivermore, CA [email protected]

Alejandro CantareroUCLA Mathematics DepartmentLos Angeles, [email protected]

MS7

A Parallel Implementation of a Fast Adaptive Com-posite Mesh Solver (FAC) in HYPRE

The FAC Solver is an extension of the multigrid methodfor solving AMR problems. Parallelization of this solverinvolves two types of communication, between the AMRhierarchy and within one AMR level. Since our FAC algo-rithm involves algebraically constructing the coarse patchoperators, both of these types of communication must beconsidered in the setup and solve phases. This talk willdiscuss the parallel implementation of our FAC solver inthe Semi-Structured Interface of HYPRE.

Barry LeeCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

MS7

A new framework to define smoothed aggregationpreconditioners for the parallel solution of non-symmetric linear systems

A new variant of the smoothed aggregation multigridmethod suitable for solving non-symmetric linear systemsis presented. We extend the classical smoothed aggrega-tion algorithm for symmetric linear systems, which firstdefines a tentative prolongator operator, applies an oper-ator of type I − ωA to it to produce the prolongator op-erator, then takes the transpose of this final operator toconstruct the restriction operator. We introduce a newframework, that can be used for both symmetric and non-symmetric problems, to compute the damping parameterω. The framework also allows the construction of Petrov-Galerkin coarse level corrections. Numerical results will bepresented, to show the validity of the proposed approachon highly non-symmetric systems, obtained with finite dif-ferences and finite elements methods.

Marzio Sala

PP06 Abstracts 55

ETHZ Computational [email protected]

Ray S. TuminaroSandia National LaboratoriesComputational Mathematics and [email protected]

MS8

Object-Oriented Generic Programming for Ab-stract Numerical Algorithms via Thyra

Object-oriented generic programming techniques allowsoftware developers to express algorithms using abstractinterfaces where details of how work is accomplished arenot prescribed a priori. Thyra defines the interfaces neededfor a large class of abstract numerical algorithms on par-allel computers. By using Thyra to access basic linear al-gebra objects and solvers, users can access broad sets ofinterchangeable solver components. In this talk we presentThyra and how it can be used for high-performance parallelnumerical software.

Roscoe A. BartlettSandia National [email protected]

MS8

The Trilinos Package Architecture and Infrastruc-ture

Trilinos is a collection of independent, interoperable andreusable software libraries for constructing and solving sys-tems of linear, nonlinear, eigen, and time-dependent equa-tions that arise in complex, high-fidelity multi-physics ap-plications on serial and parallel computers. In this pre-sentation, we discuss the Trilinos package architecture andinfrastructure, showing how Trilinos impacts software de-velopment efficiency and quality. Finally, we discuss howpackage interoperability is accomplished and its impact onsolver capabilities.

Michael A. HerouxSandia National [email protected]

MS8

PyTrilinos: A Parallel Python Interface to Trilinos

PyTrilinos is a collection of Python modules to use thesolver capabilities in Trilinos. PyTrilinos is attractive toapplications written in Python, and to others who want arapid prototyping environment. Its performance is remark-ably good and it executes in parallel via MPI. In this pre-sentation we give an overview of PyTrilinos and its designand use. We also compare it to using Trilinos via nativeC++, illustrating ease of programming and performanceresults.

William F. SpotzSandia National [email protected]

MS8

Anatomy of a Package: Anasazi

Anasazi is an extensible and interoperable framework forlarge-scale eigenvalue algorithms, designed to provide a

generic interface to a collection of algorithms for solvinglarge-scale eigenvalue problems. Anasazi eigensolvers enlistgeneric programming techniques to give the user the abil-ity to leveraging any existing software investment. Thistalk will discuss the design of Anasazi, from top to bot-tom, illustrating its capabilities for interoperability withinthe Trilinos framework and beyond.

Heidi ThornquistSandia National [email protected]

MS9

A Hybrid Software Framework for ParallelTsunami Simulations

Efficient tsunami simulations require different mathemati-cal models, numerical algorithms, and local meshes in dif-ferent regions of an ocean. We therefore present a hy-brid framework based on subdomains, where a subdomainsolver can vary from a Fortran FDM code working on auniform local mesh to a C++ FEM code on an unstruc-tured mesh. Serial wave codes are thus incorporated into aparallel hybrid tsunami simulator. Preliminary simulationresults of the Indian Ocean Tsunami will also be reported.

Xing CaiSimula Research Laboratory1325 Lysaker, [email protected]

MS9

Component-Based Quantum Chemistry: Examin-ing Coarse and Fine-Grained Scientific ComponentApplications

Component-based molecular structure optimization, inte-gral evaluation, and numerical Hessian evaluation havebeen implemented by the quantum chemistry workinggroup associated with the Common Component Architec-ture. Serving as examples of both coarse and fine-grainedcomponent applications, the development of these capabil-ities will be described in an effort to evaluate the advan-tages and disadvantages of component-based developmentin typical scientific applications. Topics of focus will in-clude components and frameworks from a scientific pro-grammer’s perspective, development costs, benefits of in-teroperabilty, and performance.

Joseph KennySandia National [email protected]

MS9

Requirements for an End-to-End Solution for theCenter for Plasma Edge Simulation Fusion Simula-tion Project

The Center for Plasma Edge Simulation, CPES, is one ofthe two funded SciDAC Fusion Simulation Projects. Amajor emphasis in this project is in coupling componentsof XGC, a massively parallel Monte Carlo particle-in-cell(PIC) code designed to model plasma in the edge regionof tokamaks, to the MHD code M3D. Working with theScalable Data Management center, we are creating new en-hancements to a workflow system (called Kepler) to createa data model for this project that allows for fast NXM datastreaming, collaborative data monitoring, and distributed

56 PP06 Abstracts

storage.

Scott KlaskyOak Ridge National LaboratoryOak Ridge, [email protected]

MS9

The Systems Biology Problem: A Component-based Approach to Computational Sciences

The major challenge for computational science in biology isdealing with disparate forms of data that are increasinglymassive in volume and diverse in format. A componentarchitecture approach is increasingly being used to addresscomplex query problems in the domain of biological sci-ence by coupling applications and software into powerfulnew technologies. I will discuss the current advances inapplications of computational science for biological chal-lenges and provide a view toward where computing needsto evolve to address future directions.

George MichaelsPacific Northwest National [email protected]

MS10

Application Performance on the TeraGrid Systems


Robert PenningtonNational Center for Supercomputing [email protected]

MS10

Preliminary Per-formance Evaluation of the Power5-based PurpleSystem


Tom SpelceLawrence Livermore National [email protected]

MS10

Performance Evaluation of the Cray XT3

Oak Ridge National Laboratory recently received deliveryof a 25 TF Cray XT3. The XT3 is Cray’s third-generationmassively parallel processing system. The system buildson a single processor node-the AMD Opteron-and uses acustom chip-called SeaStar-to provide interprocessor com-munication. In addition, the system uses a lightweight op-erating system on the compute nodes. This presentationdescribes our initial experiences with the system, includ-ing micro-benchmark, kernel, and application benchmarkresults. In particular, we provide performance results forimportant Department of Energy applications areas includ-ing climate modeling and fusion simulation.

Jeffrey S. Vetter, Patrick H. Worley, Mark FaheyOak Ridge National [email protected], [email protected], [email protected]

Sadaf Alam, Tom Dunigan, Philip RothORNL

[email protected], [email protected], [email protected]

MS10

Performance Characterization of the Cray X1E

The Cray X1E has a number of novel architectural featuresthat affect performance. While utilizing a vector proces-sor, it also has a vector cache and an additional level offine grain parallelism called streaming. It supports lowlatency distributed shared memory programming modelssuch as Co-Array Fortran, but has relatively high MPI la-tency. This talk presents an overview of the architectureand describes performance data that illustrate the promiseand peculiarities of the system.

Patrick H. WorleyOak Ridge National [email protected]

MS11

A Framework for Efficient Multigrid on High Per-formance Architectures

The hierarchical hybrid Grids (HHG) framework attemptsto remove limitations on the size of problem that can besolved using a finite element discretization of a partial dif-ferential equation (PDE) by using a process of regularrefinement, of an unstructured input grid, to generate anested hierarchy of patch-wise structured grids that is suit-able for use with geometric multigrid. The regularity of theresulting grids may be exploited in such a way that it isno longer necessary to explicitly assemble the global dis-cretization matrix. This drastically reduces the amount ofmemory required for the discretization, thus allowing for amuch larger problem to be solved. Here we present a briefdescription of the HHG framework, detailing the principlesthat led to solving a finite element system with 1.7x1010unknowns, with an overall performance of 0.96 TFLOP/s.

Ben BergenLos Alamos National [email protected]

MS11

Scalable Algebraic Multigrid on 3500 Processors

A parallel algebraic multigrid linear solver method is pre-sented which is scalable to thousands of processors on sig-nificant classes of two- and three-dimensional problems.The algorithm is entirely algebraic and does not requireprior information on the physical problem. Scalability isachieved through the use of an innovative parallel coars-ening technique in addition to aggressive coarsening andmultipass interpolation techniques. Details of this algo-rithm are presented together with numerical results on upto several thousand processors.

Wayne JoubertXiylon [email protected]

MS11

Parallel Algebraic Methods for Discretization andSolution of Finite Element Problems

While classical AMG works well on many finite elementmatrices, complicated problems such as higher-order ele-ments, elasticity and electromagnetics require additional

PP06 Abstracts 57

information, that often can be extracted only from the lo-cal stiffness matrices. We will describe a parallel element-based algebraic multigrid method which uses local traceminimization to construct high-quality vector-preservinginterpolation operators. We will also discuss the assem-bling of the stiffness matrix, the parallel element agglom-eration procedure and some numerical results.

Panayot VassilevskiLarence Livermore National Laboratorypanayot.llnl.gov

Tzanio KolevCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

MS11

The Performance of Parallel AMG Methods forSystems of PDEs on BlueGene/L

Algebraic multigrid (AMG) is a very efficient, scalable al-gorithm for solving large linear systems on unstructuredgrids. When solving linear systems derived from systemsof partial differential equations (PDEs) often a different ap-proach is required than for those derived from scalar PDEs.We will give a brief overview of AMG and present severalparallel implementations of various approaches for solvingsystems of PDEs with AMG. Advantages and disadvan-tages of these approaches are discussed, and numerical re-sults on BlueGene/L are presented.

Ulrike Meier YangCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

Allison H. BakerCenter For Applied Scientific ComputingLawrence Livermore National [email protected]

MS12

On the Use of Inexact Subdomain Solvers forBDDC Algorithms

The BDDC preconditioner is shown to be equivalent to adomain decomposition preconditioner written in terms ofsolving a partially subassembled finite element model. Theconnection of another with the FETI–DP algorithm with alumped preconditioner is also considered. Multigrid meth-ods are used to replace exact solutions in the precondition-ers and under certain assumptions, it can be establishedthat the iteration count essentially remains the same aswhen using exact solvers.

Jing LiDepartment of MathematicsKent State [email protected]

Olof WidlundCourant [email protected]

MS12

FETI-DP for Higher Order Methods

Nonoverlapping domain decomposition methods of theFETI-DP and BDDC type have been widely in use. Wepresent sequential and parallel performance results ofFETI-DP and BDDC algorithms including an inexact vari-ant of the FETI-DP method for spectral elements. This isjoined work with Axel Klawonn, Essen and Luca Pavarino,Milan.

Oliver RheinbachUniversity of [email protected]

MS12

BDDC Algorithms for Flow in Porous Media

BDDC algorithms are developed for the linear systems aris-ing from flow in porous media discretized with mixed andhybrid finite elements. Our methods are proven to be scal-able and the condition numbers of the operators with ourBDDC preconditioners grow only polylogarithmically withthe size of the subdomain problems.

Xuemin TuCourant [email protected]

MS12

An Introduction to FETI-DP and BDDC Algo-rithms and Some Recent Results

The FETI-DP and BDDC family of algorithms in manyways represent the state of the art of iterative substruc-turing methods, i.e., the domain decomposition algorithmswhich work with non-overlapping subdomains. The meth-ods will be introduced and their close connections will beexplored; recent theory suggests that the rates of conver-gence are very similar and that other considerations comeinto play when choosing to use FETI-DP or BDDC. Sev-eral extensions to new families of problems will also bementioned


MS13

Operating System Interference Effects At ExtremeScale

For parallel applications which are highly synchronous, theperformance of collective operations such as barriers andreductions are important considerations to scalability. Thecumulative nature of operating system interference effectsis particularly troublesome for large-scale parallel appli-cations. This presentation describes recent investigationsundertaken by Lawrence Livermore National Laboratory(LLNL) and IBM to understand and quantify operatingsystem interference effects. Results on LLNL’s 12,000+processor ASC Purple supercomputer are presented.

Terry Jones

NERSC/[email protected]

58 PP06 Abstracts

MS13

Handling OS Interference Via Migratable Message-Driven Objects

We examine three sources of interference from our expe-rience with applications: those due to handling of sub-normal floating point numbers on many machine/OS com-binations, those due to OS daemons, and those due to vari-able communication latencies and overheads. Such inter-ference can damage performance significantly. We showhow Charm++ and Adaptive MPI handle this challengeeffectively and automatically in many cases, with its asyn-chronous character, multiple objects per processor, auto-matic latency tolerance and dynamic load balancing.

Gengbin ZhengUniversity of Illinois at [email protected]

Issaac [email protected]

Laxmikant KaleUniversity of Illinois at [email protected]

MS13

Understanding the Causes of Performance Vari-ability in HPC Workloads

Variability of parallel application performance has broadimplications for how much useful work can be producedby a particular HPC system. This work examines perfor-mance variations in parallel applications on time scales ofyears to microseconds, with a focus on understanding thecauses of performance variability in multi-user productionenvironments. Performance variability is often the result ofcontention and reducing variability nearly always equateswith increased performance. An evaluation of actions takento achieve both of these goals is presented.

Bill Kramer, David E. SkinnerNERSC/[email protected], [email protected]

MS13

Exploring Performance Sensitivity of DistributedMemory Parallel Programs to System Interference

The degradation of parallel program performance due tooperating system interference is recognized as an impor-tant, but tunable, factor in choosing and configuring par-allel platforms. This talk discusses a methodology and pro-totype tool for studying the sensitivity of message passingcode performance to perturbations due to operating systeminterference and variability in interconnect latency. Thiswork uses messaging traces from MPI-based programs andmicrobenchmark derived system parameters to explore theperformance characteristics of real codes.

Matthew SottileLos Alamos National [email protected]

MS14

Performance Annotations on the BlueGene/L

This talk introduces the idea of using source code an-notations for addressing performance issues by permit-ting (but not requiring) access by the programmer tolow-level domain-specific performance optimizations thatare not normally performed by compilers. Performanceannotations are implemented through source transforma-tion, without directly extending the programming languagethus enabling rapid development of optimized code fordomain-specific abstractions. We will demonstrate perfor-mance improvement achieved through annotation-guidedcode generation on the Blue Gene.

William D. GroppArgonne National [email protected]

Boyana NorrisArgonne National LaboratoryMathematics and Computer Science [email protected]

Dinesh K. KaushikArgonne National LaboratoryD-247, Bldg. 221, MCS [email protected]

MS14

Performance and Memory Evaluation using theTAU Performance System

High-End applications and systems are evolving towardsmore sophisticated modes of operation, higher levels of ab-straction, and larger scales of execution. This talk will dis-cuss our current and future work on improving the utility ofparallel performance tools and evolving their capabilities tomeet high-end application requirements for the IBM Blue-Gene/L platform using the TAU performance system. Ad-vances in instrumentation techniques, development of per-formance metrics for measurement of application perfor-mance using hardware performance counters (PAPI), mea-surement of available heap memory and headroom avail-able for a process to grow, and visualization techniquesfor analysis of performance data from tens of thousands ofprocessors will be presented.

Sameer ShendeDept. of Computer and Information ScienceUniversity of [email protected]

MS14

Performance Measurement and Analysis on Blue-Gene/L Using SvPablo

SvPablo is a graphical environment for source code instru-mentation and performance analysis. It generates statisti-cal performance data including software performance dataand hardware counter metrics. The GUI provides a conve-nient means for high-level bottleneck detection, multi-leveldetailed analysis, and scalability and load balancing stud-ies. We will present SvPablo toolkit and the performanceanalysis results for a scientific application on BG/L, andalso the insights on the performance comparisons of BG/Land other HPC systems.

Ying Zhang

PP06 Abstracts 59

University of North [email protected]

MS15

Dynamic Computations in Large-Scale Graphs

Graph abstractions are used in many computationally chal-lenging science and engineering problems. For instance, theminimum spanning tree (MST) problem finds a spanningtree of a connected graph G with the minimum sum ofedge weights. MST is one of the most studied combinato-rial problems with practical applications in VLSI layout,wireless communication, and distributed networks, recentproblems in biology and medicine such has cancer detec-tion, medical imaging, and proteomics, and national secu-rity and bioterrorism such as detecting the spread of tox-ins through populations in the case of biological/chemicalwarfare, and is often a key step in other graph problems.Graph abstractions are also used in data mining, determin-ing gene function, clustering in semantic webs, and securityapplications. For example, studies have shown that certainactivities are often suspicious not because of the charac-teristics of a single actor, but because of the interactionsamong a group of actors. Interactions are modeled througha graph abstraction where the entities are represented byvertices, and their interactions are the directed edges in thegraph.

David A. BaderGeorgia Institute of [email protected]

MS15

Dynamic Data Driven Finite Element Modelling ofBrain Shape Deformation During Neurosurgery

Key challenges for neurosurgeons during tumor resectionare to remove as much tumor tissue as possible, minimizeremoval of healthy tissue and know when to stop resection.These rise due to the intra-operative brain shape deforma-tion that happens as a result of surgical procedures. Weare implementing an intra-operative MRI data driven sim-ulation process that will capture the intra-operative brainshape deformation in real time. This will help to improveintra-operative surgical navigation.

Petr KryslStructural Engineering DeptUniv of California San [email protected]

Kim BaldridgeSan Diego Supercomputer CenterUniv of California San [email protected]

Amitava MajumdarSan Diego Supercomputer [email protected]

DongJu ChoiSan Diego Supercomputer CenterUniv of California San [email protected]

Simon Warfield, Neculai ArchipRadiology Dept, Brigham and Women’s HospitalHarvard Medical School

[email protected], [email protected]

MS15

Tacking Obesity in Children

This talk will focus on work in progress in datamining andparametric modeling on medical data in a high end com-puting environment to study and understand the currentstudent health screening in children and reporting prac-tice. Plans for developing a tracking system to assimilatethe screened data, analyze them and evaluate the efficiencyand utility of the system based on feedback and develop-ment of guidelines for the future implementation of thesystem, including potential economies of scale will be dis-cussed.

Weimo ZhuUniversity of Illinois at [email protected]

Maryann MasonNorthwestern [email protected]

Radha NandkumarNational Center for Supercomputing [email protected]

MS15

Data-Driven Parallelization in Multi-Scale Appli-cations

Conventional parallelization typically decomposes the statespace. It scales well when the computational effort arisesfrom simulating a large state-space, but not when the ef-fort arises from simulating for long times. We show how wecan parallelize the time domain instead, using data fromrelated simulations. This enables parallelization to scaleup to three orders of magnitude greater numbers of pro-cessors than conventional parallelization, on important ap-plications involving multiple time-scales in nanotechnologyand computational biology.

Ashok Srinivasan, Yanan YuDepartment of Computer Science,Florida State [email protected], [email protected]

MS16

A Graph Infrastructure for Multithreaded Archi-tectures

The next generation of supercomputers will include mas-sively multithreaded architectures specifically designed totolerate latency. This is of direct interest to graph algo-rithm designers since these algorithms typically have poormemory locality. Sandia is developing a small softwareinfrastructure to support algorithm development on theCray MTA-2/Eldorado multithreaded machines. We willdescribe this infrastructure and some algorithmic idiomssuggested by multithreaded architectures. We will alsopresent some performance results for a few core graph al-gorithms

Jonathan BerrySandia National [email protected]

60 PP06 Abstracts

MS16

The Parallel Boost Graph Library

The Parallel Boost Graph Library is a research plat-form for the development, evaluation, and disseminationof distributed-memory graph algorithms. The core of theParallel BGL is a set of concepts that state the abstractrequirements on graphs, auxiliary data structures, and theunderlying communication layers. By implementing algo-rithms using only these requirements, any component ofthe Parallel BGL can be replaced with another that im-plements the same, allowing independent experimentationand optimization of library components.

Douglas Gregor, Andrew LumsdaineOpen Systems LaboratoryIndiana [email protected], [email protected]

MS16

Computing Approximate Matchings in Parallel

Computing a (possibly weighted) matching in a graph is animportant task in many scientific applications. Althoughthere exist exact polynomial time algorithms for the match-ing problem these algorithms are often to time consumingto used in practice. Lately several linear time approxima-tion algorithms with bounds as good as 2/3 within the op-timal matching have been proposed. In this talk we presentresults from parallelizing these algorithms.

Fredrik ManneUniversity of Bergen, [email protected]

MS16

Scalable Graph-Theoretical Approaches to Biolog-ical Network Analysis

Many biological objects are naturally represented asgraphs. Examples include metabolic, signaling and reg-ulatory pathways, protein interaction networks, and chem-ical compound graphs. The elucidation of genome-scalestructure-function relationships between these objects ne-cessitates the development of more efficient and effectivemethods for the comparison of their underlying graphs.We present scalable graph-theoretical approaches to thisproblem including graph matching, maximum clique find-ing and maximal clique enumeration. Performance bench-marks on advanced hardware architectures will be also pre-sented.

Yun Zhang, Michael LangstonComputer Science DepartmentUniversity of Tennessee, [email protected], [email protected]

Nagiza F. SamatovaOak Ridge National LaboratoryComputer Science & Mathematics [email protected]

Faisal Abu-KhzamDivision of Computer Science and MathematicsLebanese American University, Beirut,[email protected]

Henry Sutersepartment of Mathematics and Computer Science

Carson-Newman [email protected]

MS17

Parallel Tet Mesh Generation: An Overview

Parallel mesh generation is a relatively new research areabetween the boundaries of two scientific computing disci-plines: computational geometry and parallel computing.In this talk we present a survey of parallel unstructuredmesh generation methods. Parallel mesh generation meth-ods decompose the original mesh generation problem intosmaller subproblems which are meshed in parallel. We or-ganize the parallel mesh generation methods in terms oftwo basic attributes: (1) the sequential technique used formeshing the individual subproblems and (2) the degree ofcoupling between the subproblems. This survey shows thatwithout compromising in the stability of parallel mesh gen-eration methods it is possible to develop parallel meshingsoftware using off-the-shelf sequential meshing codes. How-ever, more research is required for the efficient use of thestate-of-the-art codes which can scale from emerging chipmultiprocessors (CMPs) to clusters built from CMPs.

Nikos ChrisochoidesComputer ScienceCollege of William and [email protected]

MS17

Mesh Generation for Non-Regid Registration

Adaptive mesh generation and refinement impose uncom-mon requirements on runtime support substrate. Conven-tional communication libraries like MPI are not well-suitedfor development of mesh generation codes. In this talk wewill present the communication library, Clam, which aimsthe applications of interest, and describe our experienceusing this code on CoWs and on TeraGrid.

Andriy FedorovHarvard Medical [email protected]

MS17

Challenges in Parallel Meshing of Evolving Geome-tries

The Center for Simulation of Advanced Rockets (CSAR)conducts research on whole-system simulation of solid pro-pellant rockets on highly-parallel computers. Importantsubsystems include solid mechanics of the propellant andfluid dynamics of the interior flow and exhaust plume.As the geometry of the solid propellant and interior flowevolves, increasingly aggressive mesh adaptation is re-quired, including mesh smoothing, local mesh repair, andin some cases global remeshing. Mesh smoothing relocatesvertices to maintain mesh quality without changing ele-ment connectivity. The smoothing procedure must be effi-cient enough to be invoked at every few time steps by everyprocessor and must respect displacement limits imposed bythe ALE (Arbitrary Lagrangian Eulerian) formulation ofour physics solvers. Local mesh repair is applied for se-vere geometric deformation, eliminating poor elements inthe problematic area while preserving good elements else-where. Parallelization of local repair is challenging becausethe area in need of repair can be shared by an arbitrarynumber of processors. Global remeshing of the entire do-

PP06 Abstracts 61

main has the advantage of being independent of deteriora-tion of the old mesh, but has the disadvantage of requiringglobal solution transfer, which entails an expensive geo-metric search. For a mesh of moderate size, sequentialremeshing may suffice, but for a very large one, parallelmesh generation is necessary. For each of these three levelsof mesh improvement, our implementations involve someuse of third-party software. We will address software inte-gration issues as well.

Damrong GuoyUniversity of Illinois at [email protected]

MS17

Parallel 3D mesh generation Algorithms Applied toComplex Biology Problems Using Global Arrays


Harold E. TreasePacific Northwest National [email protected]

MS18

An Approximate BDDC Preconditioner with Ap-plications

BDDC (Balancing Domain Decomposition by Constraints)preconditioners require direct solutions of two linear sys-tems for each substructure and one linear system for aglobal coarse problem. The computations and memoryneeded for these solutions can be prohibitive if any one sys-tem is too large. An approach is presented for addressingthis issue based on approximating the direct solutions bypreconditioners. Example parallel computations demon-strate the usefulness of the approach in terms of reducedmemory requirements and decreased computing times.

Clark R. DohrmannSandia National [email protected]

MS18

A Parallel FETI-DP Mortar Method for Secondand Fourth Order Elliptic PDEs

We discuss a parallel implementation and performance re-sults on up to 1000 processors of a FETI-DP methodfor second order elliptic partial differential equations withhighly discontinuous coefficients on subdomains with non-matching triangulations. We also present preliminary re-sults on the extension of this method for fourth order plateproblems with mortar interface conditions for completeand reduced Hsieh-Clough-Tocher finite element discretiza-tions.

Nina N. DokevaUniversity of Southern [email protected]

MS18

Parallel FETI Based Algorithms for Numerical So-lution of Variational Inequalities

We shall first briefly review our theoretical results con-cerning optimal algorithms for the solution of boundand/or equality constrained quadratic programming prob-

lems. Then we shall review our results on scalability ofour FETI-DP and FETI based algorithms for numericalsolution of model coercive and semicoercive variational in-equalities, including mortar contact discretization. Finallywe shall extend the scalability results to the elasticity in-cluding 2D problems with given (Tresca) friction and givenumerical results with parallel implementation.

Zdenenk DostalOstrava, The Czech [email protected]

D. HorakTechnical University Ostrava, The Czech [email protected]

D. StefanicaBaruch Collegenot available

MS18

BDDC and FETI-DP for Mortar Finite ElementMethods

A BDDC (balancing domain decomposition by constraints)algorithm is developed for elliptic problems with mortardiscretizations for geometrically non-conforming partitionsin both two and three spatial dimensions. The coarse com-ponent of the preconditioner is defined in terms of one mor-tar constraint for each edge/face which is an intersectionof the boundaries of a pair of subdomains. A conditionnumber bound of the form Cmaxi{(1 + log(Hi/hi))

3} isestablished. In geometrically conforming cases, the boundcan be improved to Cmaxi{(1 + log(Hi/hi))

2}. This esti-mate is also valid in the geometrically non-conforming caseunder an additional assumption on the ratio of mesh sizesand jumps of the coefficients. In addition, this BDDC algo-rithm is closely related to the FETI–DP algorithm with theNeumann-Dirichlet preconditioner. It is proved that theseBDDC and FETI–DP algorithms share the same spectraexcept possibly for an eigenvalue equal to 1. This facthas been also established in BDDC algorithms applied toconforming finite element approximation. Numerical re-sults present that the BDDC algorithm has 1 as its min-imum eigenvalue and the FETI–DP algorithms has all ofits eigenvalues grater than 1.

Hyea Hyun KimCourant Institute, [email protected]


MS18

Approximate FETI-DP preconditionng for Nonlin-ear Solid Mechanics

The FETI algorithms are numerically scalable iterative do-main decomposition methods for solving equations arisingfrom the Finite Element discretization of second- or fourth-order elasticity problems. The one level FETI methodequipped with the Dirichlet preconditioner was shown tobe numerically scalable for second-order elasticity prob-lems while the two level FETI method was designed to benumerically scalable for fourth-order elasticity problems.Most recently, the dual-primal FETI method (FETI-DP)has emerged as an attractive alternative to the first FETI

62 PP06 Abstracts

algorithms. Implicit nonlinear solid mechanics applicationshave a long history at Sandia. Adagio is the latest implicitsolid mechanics code that uses FETI-DP as a precondi-tioner within the framework of a nonlinear conjugate gra-dient solver. I will report on serial, parallel, approximate,and adaptive algorithmic extensions to FETI-DP that haveenabled a larger class of problems to be solved.

Kendall H. PiersonSandia National [email protected]

MS19

Using Parallel Coupling to Evolve Multi-PhysicsSimulations

There is little disagreement that component architecture isa revolutionary step away from traditional software prac-tices and it is also well-accepted that most software evolvesfrom other software. On the face of it, the transition fromdisparate to integrated multi-physics component-orientedapplications requires a ”leap”. This presentation will fo-cus on using parallel coupling between erstwhile sepa-rate applications to evolve a successively more integratedcomponent-oriented code. Examples from practice will becited.

Rob ArmstrongSandia National [email protected]

MS19

Comparison of Approaches to Using the Earth Sys-tem Modeling Framework

The ESMF is emerging as an open standard in the cli-mate and weather domain. Its component-based, hierar-chical architecture is being adopted by a diverse groupof modelers, who have experimented with ways to orga-nize their applications using the framework. In this talkwe examine level of componentization, modes of executionand sequencing, and use of framework services and datastructures in ESMF-based applications from UCLA, MIT,NASA, GFDL, and elsewhere.

Atanas TrayanovNASA [email protected]

C. Roberto [email protected]

Max Suarez, Arlindo da SilvaNASA [email protected], [email protected]

Chris HillEarth, Atmospheric and Physical [email protected]

V. BalajiNOAA [email protected]

Nancy CollinsNational Center for Atmospheric Research

Boulder, [email protected]

Cecelia DeLucaComputing Science SectionNational Center for Atmospheric [email protected]

MS19

Cpl6: Software for Building Parallel Coupled Cli-mate Models

Cpl6 is the coupling software for the Community ClimateSystem Model, version 3. Cpl6 provides a reusable mainprogram and four spokes to which atmosphere, ocean, seaice and land models are attached to create a coupled cli-mate model. Coupling is achieved by placing calls to thecpl6 library within the component models. All cpl6 datatransfer and interpolation routines are fully parallel. Us-ing cpl6, user’s have successfully added new models to theCCSM3 system.

Tony Craig, Brian KauffmanNational Center for Atmospheric [email protected], [email protected]

Jay W. LarsonMath and Computer Science DivisionArgonne National [email protected]

Robert JacobMathematics and Computer Science DivisionArgonne National [email protected]

MS19

Component-Based Multi-Physics Simulations ofFires and Explosions

We describe the challenges in creating detailed simulationsof fires and explosions at the University of Utah. We dis-cuss how we address those challenges using a component-based infrastructure that it designed for integrated multi-physics applications, including highly parallel simulationsof complex fluid-structure interaction problems. This ap-proach has been used to simulate the response of energeticdevices subject to harsh environments such as hydrocarbonpool fires.

Steven G. ParkerScientific Computing and Imaging InstituteUniversity of [email protected]

MS19

Flexible Control over Inter-Component DataTransfers for Multiphysics Simulations

Allowing loose coupling between the components of com-plex multiphysics simulations has many advantages, in-cluding being able to easily incorporate new applicationcomponents and flexible specification of how the compo-nents are connected to transfer data between them. Tofacilitate efficient data transfers between application com-ponents, the InterComm environment enables specifyingwhen data transfers are desired outside the individual com-ponents. The talk focuses on the design and implementa-

PP06 Abstracts 63

tion of efficient methods for deciding at runtime when datatransfers should occur.

Alan SussmanUniversity of MarylandDepartment of Computer [email protected]

MS20

Whole Genome, Physics-based Sequence Align-ment for Pathogen Signature Design

Two computational obstacles currently impede hybridiza-tion based pathogen detection assay design: the rapidgrowth in the number of DNA sequences in GenBank andthe reliance on heuristic metrics of sequence similaritypoorly suited to DNA hybridization. We have developed ahigh performance computational pipeline incorporating atemperature-dependent, free energy based measure of se-quence similarity and assay specific physical constraints.DNA signatures have been designed for both national se-curity and food safety applications.

Wu Feng, Murray Wolinsky, Jason D. GansLos Alamos National [email protected], [email protected], [email protected]

MS20

ScalaBLAST: Scalable High-Performance SequenceAlignment

ScalaBLAST is a high-performance BLAST engine builtusing the Global Array toolkit. This shared memory in-terface makes it possible to efficiently operate on large se-quence databases on distributed and shared memory archi-tectures. ScalaBLAST is implemented using a combinationof query scheduling and database sharing. This hybrid ap-proach has enabled large-scale sequence alignments to beperformed at the whole-genome scale, opening the doorfor informatics-driven science discovery using data-driventechniques such as genome context mining.

Jarek NieplochaPacific Northwest National LaboratoryMS-IN: K1-85, P.O. Box 999, ichland, WA [email protected]

Christopher S. OehmenPacific Northwest National [email protected]

MS20

PARALIGN: Rapid Sequence Similarity SearchesUtilizing Several Levels of Parallelism

PARALIGN is a rapid and sensitive tool for sequence simi-larity search tool for identification of distantly related genesand proteins in sequence databases. Two methods areimplemented: accelerated Smith-Waterman and ParAlign.Both methods utilize parallelism at several levels. Single-Instruction Multiple-Data (SIMD) technology is used togreatly speed up sequence comparison at the sub-sequencelevel. This technology, also known as multimedia tech-nology, is available in most modern processors, but rarelyused by other bioinformatics software. At a higher level,the database is split and searched in parallel by multiplethreads and multiple nodes communicating using MPI. A

public search service is available at www.paralign.org

Torbjørn RognesUniversity of OsloCenter for Molecular Biology and [email protected]

Jon MyrsethCenter for Molecular Biology and Neuroscience (CMBN)Rikshospitalet-Radiumhospitalet and University of [email protected]

Sten Morten AndersenSencel Bioinformatics ASPO Box 180 Vinderen, NO-0319 Oslo, [email protected]

Jon LærdahlCenter for molecular biology and neuroscience (CMBN)Institute of Medical Microbiology, and University of [email protected]

Per Eystein SæboCenter for Molecular Biology and Neuroscience (CMBN)Institute of Medical Microbiology, and University of [email protected]

MS20

Efficient Data Handling in Comparative GenomeAnalysis Applications

Comparative genome analysis is a routine but computa-tionally demanding task in biology. While paralleliza-tion of an analysis engine addresses this computation-intensive problem, data-intensive initial preparation andresult merging tasks become a performance bottleneck withthe exponential growth in sequence database sizes. Wepresent a set of techniques for efficient and flexible datahandling in parallel sequence analysis applications. Ap-plying these techniques to mpiBLAST leads to an orderof magnitude overall performance and scalability improve-ment.

Nagiza F. SamatovaOak Ridge National LaboratoryComputer Science & Mathematics [email protected]

Heshan Lin, Xiaosong MaDepartment of Computer ScienceNorth Carolina State [email protected], [email protected]

Wu-chun FengAdvanced Computing LaboratoryLos Alamos National [email protected]

Al GeistOak Ridge National LaboratoryComputer Science and Mathematics [email protected]

MS21

Combinatorial Preprocessing of Matrices for Directand Iterative Solution of Large Sparse Systems

We discuss a simple technique that is a very powerful tool

64 PP06 Abstracts

for preprocessing large sparse matrices. This technique ob-tains a permutation that maximizes the modulus of theproduct of entries on the diagonal of the permuted matrix.A consequent scaling ensures that the permuted diagonal isall one and all off-diagonal entries have modulus less thanor equal to one. Since the MC64 code for doing this wasincluded in HSL in 1999, many people have used the codein a wide range of applications some of which we discussin this talk.

Iain DuffRutherford-Appleton Laboratory and [email protected]

MS21

A Parallel Version of GMRES Preconditioned bythe Multiplicative Schwarz Method

Domain decomposition provides a class of divide-and-conquer methods suitable for the solution of linear or non-linear systems of equations arising from the discretizationof partial differential equations. For linear systems, do-main decomposition methods can be viewed as precon-ditioners for Krylov subspace techniques. Traditionally,there are two classes of iterative methods that derive fromdomain decomposition with overlapping: Additive Schwarzand Multiplicative Schwarz. The former can easily be par-allelized but the latter is usually a better preconditioner.We present a parallel version of the combination of theGMRES method and of the Multiplicative Schwarz precon-ditioner. The algorithm is based on a parallel constructionof a non orthogonal basis of the Krylov subspace. Par-allelism is obtained by pipelining two recursions, namelythe Multiplicative Schwarz recursion and the Krylov basisconstruction : it occurs through a front-wave of these tworecursions. Orthogonalization of the basis is performed bya fully parallel algorithm. One of the main goal of theimplementation is to avoid almost all the scalar productsof the vectors distributed over the processors in order toforbid global communications and therefore to reach a bet-ter scalability. The developed code is based on PETSc andMPI standards. In order to obtain a full chain operating asa black box linear system solver, an automatic partitioneris also under development.

Guy Antoine Atenekeng-KahouUniversity of Yaounde I and INRIA/IRISAguy.atenekeng [email protected]

Bernard PhilippeIRISA-INRIARennes [email protected]

Emmanuel KAMGNIAUniversity of Yaounde [email protected]

Roger Blaise SidjeUniversity of [email protected]

MS21

Preconditioning with Non-Symmetric Permuta-tions

This talk will discuss a preconditioning method based oncombining two-sided permutations with a multilevel ap-proach. The nonsymmetric permutation exploits a greedy

strategy to put large entries of the matrix in the diago-nal of the upper leading submatrix. The method can beregarded as a complete pivoting version of the incompleteLU factorization. This leads to an effective incomplete fac-torization preconditioner for general nonsymmetric, irreg-ularly structured, sparse linear systems. The algorithm isimplemented in a multilevel fashion and borrows from theAlgebraic Recursive Multilevel Solver (ARMS) framework.Numerical experiments will be reported on problems thatare known as difficult to solve by iterative methods.

Yousef SaadDepartment of Computer ScienceUniversity of [email protected]

MS21

A Parallel Hybrid Banded System Solver: theSPIKE Algorithm

We describe a robust hybrid parallel solver “the SPIKEalgorithm” for banded linear systems. Two versions ofSPIKE with their built-in-options are provided: (a) Re-cursive SPIKE for handling non-diagonally dominant sys-tems, and (b) Truncated SPIKE for diagonally dominantsystems. These SPIKE schemes can be used either asdirect solvers, or as preconditioners for outer iterativeschemes. Both versions are faster than the direct solverscurrently used in ScaLAPACK on parallel computing plat-forms. The SPIKE scheme is also quite competitive interms of achieved accuracy for handling systems that aredense within the band. Further, we provide effective ver-sions of the SPIKE algorithm for handling systems that aresparse within the band.

Eric PolizziDepartment of Computer SciencePurdue [email protected]

Ahmed SamehDepartment of Computer SciencePurdue [email protected]

MS22

Query-Driven Visualization

As datasets have grown larger and more complex, many re-search projects have focused on techniques for ”scalable vi-sualization.” While these technologies are more capable ofprocessing larger datasets, they don’t necessarily improvescientific insight. Query-driven visualization is a differentapproach that focuses visualization and analysis processingonly on data deemed to be ”interesting.” Our approach toquery-driven visualization leverages state of the art index-ing and query technology to extract ”interesting” subsets ofdata that are then used as input to a visualization or anal-ysis processing pipeline. This approach holds the potentialto foster greater scientific insight by focusing processingresources only on ”interesting data.”

Wes BethelLawrence Berkeley National [email protected]

MS22

Approaches for Large-Scale Data Analysis and Vi-

PP06 Abstracts 65

sualization

Placeholder text: Over the years, numerous approacheshave been explored for use in high performance, large scaledata analysis and visualization. These include parallel im-plementations of common visualization algorithms (isosur-facing), parallel rendering algorithms (object order and im-age order) along with techniques to accelerate the underly-ing algorithms themselves based upon qualities of the data.This presentation will present a survey of these techniquesalong with a discussion of ”lessons learned.”

Charles HansenUniversity of [email protected]

MS22

Efficient Streaming Multiresolution Data Modelsfor Visualization and Analysis

Placeholder text. Of the many approaches to high per-formance visualization and analysis, data organization andaccess strategies offer the ability to quickly interrogate, ac-cess, analyze and visualize data on platforms ranging froma lowly laptop to a powerful dedicated parallel visualizationsystem. Our strategy is to exploit the coupling betweentime-critical algorithms and progressive multi-resolutiondata-structures to realize an end-to-end optimized flow ofdata from the original source, such as remote storage orlarge scientific simulation, to the rendering hardware.

Valerio PascucciLawrence Livermore National [email protected]

MS22

Data Analysis Challenges for Terascale SupernovaeModeling

Large scale radiation-hydrodynamic simulations of corecollapse supernovae produce copious amount of data forseveral reasons. First, the required spatial and energyresolution means that the number of of variables in theproblem is quite large. Secondly, this phenomena involvestimescales that require long temporal evolutions which pro-duce a lengthy time series of data. These characteristicsprovide challenges for parallel I/O, data transfer, datamanagement, post-run analysis, and scientific visualiza-tion. In this talk we discuss these challenges in detail anddescribe some of the solutions that we have employed toaddress them.

Doug SwestySUNY at Stony BrookDept. of Physics & [email protected]

MS23

Parallel Methods for Nano/Materials Science Ap-plications

The nano/materials science community is one of the largestusers of computer cycles in computer centers all over theworld. They were one of the first communities to parallelizeand redesign their methods to run efficiently on parallelcomputers. In recent years through the funding of compu-tational projects such as the DOE scidac projects there hasbeen renewed effort in developing new parallel algorithmsfor larger nanoscale applications that scale to larger pro-

cessor counts. The new parallel algorithms developed forthese codes will be discussed as well as new parallel meth-ods for solving the physics equations. Performance of thesemethods/codes will also be discussed on the leading HPCparallel platforms such as the IBM BG/L, Cray X1 andEarth Simulator.

Andrew M. CanningLawrence Berkeley National [email protected]

MS23

Large-Scale First-Principles Molecular Dynamicsfor Nanoscience

We present recent applications of first-principles molecu-lar dynamics (FPMD) to nanoscience. FPMD simulationsprovide a consistent and simultaneous description of atomictrajectories and electronic structure. Recent advances inthe development of large-scale parallel computers have con-siderably increased the size of systems that can be mod-eled with FPMD. We present examples of applications ofFPMD to nanoscience and discuss the current limits of thisapproach. We also discuss some implementation challengeson large parallel computers, including the BlueGene/L su-percomputer.

Francois GygiUniversity of California, DavisLawrence Livermore National [email protected]

MS23

Direct Parallel Quantum Simulations of the Prop-erties of Magnetic

It is becoming increasingly clear that, with continued the-oretical and algorithmic development and more powerfulcomputers, direct first-principles simulation of the prop-erties - magnetic moments, magnetic anisotropy energies,thermal stability, and switching modes and times of mag-netic nanostructures can play a key role in developmentof future magnetic storage, magnet sensor, and perma-nent magnet devices. Here I will outline the theory, al-gorithms, and our first principle multiple scattering meth-ods for simulating the magnetic properties of surface, bulk,and isolated magnetic nanostructures on parallel comput-ers. To illustrate what can now be achieved I will showresults for some magnetic nanostructures that are of di-rect experimental interest Fe and Co nanowires at Pt-surface step-edges, as well as Fe and FePt-nanoparticlescontaining many thousands of atoms. Work supportedby the DOE-OS through the Offices of Basic Energy Sci-ences (BES),Division of Materials Sciences and Engineer-ing. Calculations were performed at the National En-ergy Research Supercomputer Center, the Pittsburg Su-percomputer Center, and the ORNL-Center for Compu-tational Sciences. Oak Ridge National Laboratory ismanaged by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725. Work performed in collaboration withYang Wang (PSC), Aurelian Rusanu and J. S. Faulkner(Florida Atlantic University), Markus Eisenbach, DonNicholson (ORNL), Balazs Ujfalussy, Lazlo Szunyogh ,Bence Lazarovits (Budapest), Balazs Gyorffy (Bristol) andPeter Weinberger (Vienna)

Malcolm StocksTheory Group, M&C DivisionOak Ridge National Laboratory

66 PP06 Abstracts

[email protected]

MS23

Accelerating Parallel Electronic NanostructureCalculations Through Bulk Band Preconditioning

The need for efficient iterative parallel eigensolvers in thefield of computational nanotechnology is well recognized.This is because most electronic nanostructure calculationslead to discrete eigenvalue problems of very large size. Anobstacle to the efficiency of the iterative solvers is thattheir convergence depends on the condition number of thesystem at hand. We have developed a preconditioningtechnique, called bulk band (BB preconditioner), that al-leviates this problem and gives a significant computationalspeed improvement compared to the best currently usedpreconditioner. The basic idea behind BB is to use theelectronic properties of the bulk materials constituent forthe nanostructure in designing a preconditioner. The bulkband states can be computed inexpensively from 2 atomunit cells. Since the nanostructure eigenstates are consis-tent with these bulk band states, the bulk band can beused to construct a preconditioner which would be moreefficient than using the plane-wave basis.

Stanimire TomovInnovative Computing Laboratory, Computer ScienceDeptUniversity of Tennessee, [email protected]

MS24

Managing Complexity in Parallel Adaptive Multi-phyiscs Applications

Massively parallel processing and adaptive methods are be-ing used by advanced scientific and engineering applica-tions to perform simulations with high fidelity models andmultiple coupled physics. Managing complexity is chal-lenging and critical to the success of such parallel adap-tive multiphysics applications. The Sierra Framework usesmathematical abstractions from set theory to manage thiscomplexity for a family of such applications. This theoryand its application to parallel dynamic distributed meshdata management will be presented.

H. Carter EdwardsSandia National [email protected]

MS24

Techniques for Bandwidth Management in the LociFramework

The Loci framework employs a logic-relational model forthe description continuum mechanics solution algorithms.The framework has been used in the development of a widerange of applications ranging from computational combus-tion to aircraft icing simulations. In this talk we describetechniques that have been used to increase performanceof Loci applications on bandwidth limited architectures.These transformations include segmenting work into cachesized blocks and replicating fast computations to reducebandwidth limited communications.

Edward LukeDept. of Computer Science and EngineeringMississippi State [email protected]

MS24

Efficient Massively Parallel Adaptive Algorithmsfor Time-Dependent Transport

This talk presents the implementation a general infrastruc-ture for the simulation of physical systems whose evolutionsdepend on the transport of subatomic particles coupledwith other complex physics. We will present the discrete-ordinates method and its most time consuming step, theordered traversal of the spatial grid in each direction ofparticle travel. We have implemented the parallel versionof this simulation using STAPL, a parallel version of STLunder development in our group.

Lawrence RauchwergerComputer Science DepartmentTexas A&M [email protected]

MS24

Effective Indexing of Distributed MultidimensionalScientific Datasets

Applications that query into very large distributed multi-dimensional datasets are becoming more common. Thesedatasets are becoming truly enormous, and are often storedacross multiple sites. Therefore, we have been developingindexing algorithms that work well in a distributed com-puting environment, employing both hierarchy and repli-cation to provide scalability and maximize performance ofboth searches and updates. We also describe a completelydecentralized scheme that takes advantage of the relia-bility mechanisms inherent in unstructured peer-to-peer-systems.

Alan SussmanUniversity of MarylandDepartment of Computer [email protected]

MS25

parpcx: A Parallel Solver for Linear Programming

Surprisingly, there appears to be no publicly available LPsolver suitable for solving large problems in parallel ondistributed-memory machines (clusters). We describe thedesign of parpcx, a parallel LP solver based on PCx, whichmay fill this gap. parpcx goes beyond the original PCx inthat it implements more choices of linear algebra methods,allowing experiments and easy comparisons. We presentsome preliminary results with LPs from sensor placement.Also, we discuss how hypergraph partitioning may be usedin LP decomposition methods.

Ojas ParekhEmory [email protected]

Erik G. BomanSandia National Labs, NMDiscrete Math and [email protected]

MS25

Direct Parallel Solution of Linear Systems of Di-mension 109

Linear systems of equations resulting from very large op-

PP06 Abstracts 67

timization problems are usually solved by iterative meth-ods. However this needs careful crafting of an appropriatepreconditioner, and might still fail to reach adaequate ac-curacy. Our object-oriented parallel solver OOPS can usedirect factorizations for these systems by exploiting blocksparsity. We have recently ported the code to massivelyparallel architechture and are now able to solve stochasticQP problems exceeding 109 variables.

Andreas GrotheySchool of MathematicsUniversity of [email protected]

Jacek GondzioThe University of Edinburgh, United [email protected]

MS25

A Convex QP Solver Based on Block-LU Updates

Active-set methods have an advantage over interior meth-ods in permitting warm starts. We describe a convex QPmethod intended for use within SNOPT (a sparse SQPpackage for constrained optimization). An initial KKTsystem is factorized by any available method (such as LU-SOL or PARDISO). Active-set changes are implementedby block-LU updates that retain sparsity while leaving theoriginal KKT factors intact.

Michael SaundersSystems Optimization Laboratory (SOL)Dept of Management Sci and Eng, [email protected]

Hanh HuynhStanford [email protected]

MS25

Combinatorial Approaches to the Solution ofSaddle-Point Problems in Large-Scale ParallelInterior-Point Optimization

Interior-point methods are among the most efficient ap-proaches for solving large-scale nonlinear programmingproblems. At the core of these methods, highly ill-conditioned symmetric saddle-point problems have to besolved. We present combinatorial methods to preprocessthese matrices in order to establish more favorable numeri-cal properties for the subsequent solution step. We demon-strate the competitiveness of this approach within IPOPTon a large set of test problems from the CUTE and COPSsets, as well as large optimal control problems involving 2Dpartial differential equations. The largest nonlinear opti-mization problem solved has more than 12 million variablesand 6 million constraints.

Andreas WachterIBM T.J. Watson Research CenterYorktown Heights, [email protected]

Michael HagemannComputer Science Departmenrt Universitg of [email protected]

Olaf SchenkDepartment of Computer Science, University of Basel

[email protected]

MS26

A Hybrid 2D Method for Sparse Matrix Partition-ing

Two-dimensional matrix partitioning has been shown tobe effective for many irregular applications, including datamining, circuit simulation and electrophoresis. 2D ap-proaches include recursive bipartitioning in the Mondriaanpartitioner, fine-grained partitioning provided by PaToH,and coarse-grained partitioning based on multiconstrainthypergraph partitioning (also in PaToH). We present a hy-brid adaptive approach combining the recursive and fine-grained techniques to reduce communication volume, anddemonstrate its use in non-PDE applications.

Rob H. BisselingDept. MathematicsUtrecht [email protected]

Umit V. CatalyurekDepartment of Biomedical InformaticsThe Ohio State [email protected]

Tristan van LeeuwenUtrecht [email protected]

MS26

Parallel Hypergraph Partitioning for IrregularProblems

Parallel simulations of electrical circuits and inhomoge-neous fluids do not perform well with standard partition-ing techniques. Similarly, many partitioners do not applynaturally to linear programming and network-based sim-ulations. Partitioners that accommodate non-symmetry,heterogeneity, and high connectivity are needed. Hyper-graph models naturally represent such applications andprovide decompositions with lower communication volumeand higher parallel performance. We present parallel hy-pergraph partitioning algorithms and demonstrate theirperformance for circuit and density functional theory ap-plications.

Robert HeaphySandia National [email protected]

Umit V. CatalyurekDepartment of Biomedical InformaticsThe Ohio State [email protected]

Erik G. BomanSandia National Labs, NMDiscrete Math and [email protected]

Karen D. DevineSandia National [email protected]

68 PP06 Abstracts

MS26

Parallel Algorithms for Inhomogeneous FluidsModels using Molecular Density Functional The-ories

Molecular DFT problems typically involve tens to one hun-dred or more unknowns per mesh point, requiring the so-lution of inherently discrete equations, short-to-mediumrange integral equations and PDEs. In this presentation,we describe a new family of algorithms that provides ordersof magnitude improvement in performance and robustnessover previous general-purpose PDE-based approaches. Wehighlight how these problems are unlike PDEs and howwe addressed these differences to achieve robust, scalablesolutions.

Michael A. HerouxSandia National [email protected]


Laura FrinkSandia National [email protected]

MS26

Scalable Electrical Circuit Simulation: Why PDEAlgorithms Don’t Fit

Unlike mesh based PDE simulation, electrical circuit sim-ulation involves networks of nonlinear device models withno basis in spatial operators. This means many assump-tions made in linear and nonlinear algorithms for solutionof PDEs may not be appropriate for this class of problems.Development of the Xyce(TM) simulator has lead to de-velopment of new algorithms in these areas which can sub-stantially improve robustness and massively parallel per-formance for this class of problems.

Michael A. Heroux, David DaySandia National [email protected], [email protected]

Roger PawlowskiSandia National [email protected]

Robert J. Hoekstra, Eric KeiterSandia National [email protected], [email protected]

MS27

Parallelism and Linear Solver Performance inMHD and PIC Simulations of Burning Plasmas


Mark F. AdamsColumbia [email protected]

MS27

Exploring Mixed MPI/OpenMP Parallelizationand Matrix Reordering in implicit MHD Compu-

tations

In many applications,the stiffness of extended-magnetohydrodynamics requiresimplicit time-stepping when applying numerical computa-tion. However, the range of dynamical scales leads to ill-conditioned matrices that have dense blocks resulting fromhigh-order spatial methods. Combining direct methods forapproximate matrices with matrix-free iterative solves hasproven successful, but parallel efficiency is needed for dif-ferent and sometimes conflicting aspects of the algorithm.Efforts combining OpenMP and MPI parallelization andapplying sparse-matrix reordering using nested dissectionare reported.

Christopher Carey, Carl R. SovinecUniversity of [email protected], [email protected]

MS27

Numerical Challenges in Extended Magnetohydro-dynamics Modeling


Jin ChenPrinceton Plasma Physics [email protected]

MS27

Performance and Experiences on Porting XGC tothe Cray X1E and XT3


Eduardo F. D’AzevedoOak Ridge National LaboratoryMathematical Sciences [email protected]

MS28

DUNE – The Distributed and Unified NumericsEnvironment

The main aspect of this talk is the DUNE Parallel Grid In-terface. Using C++ techniques DUNE allows to use verydifferent grid implementation under a common interfacewith a very low overhead. We will give an overview over thetechinques used and present numerical simulations doneon different grids using the same discretization. The abili-ties of these grids range from sequential structured grid toparallel adaptive grids with load balance and will includeconforming as well as non conforming grids.

Peter BastionUniversity of Heidelberg, [email protected]

Christian EngweUniversity of [email protected]

MS28

Efficient Data Management in Parallel Finite El-ement Applications through Object-Oriented De-sign

Often times the bookkeeping in parallel unstructured finite

PP06 Abstracts 69

element applications becomes cumbersome. In order to al-leviate this burden from the application developer we havecreated the Simple Parallel Object-Oriented ComputingEnvironment for the Finite Element Method (SPOOCE-FEM). The SPOOCEFEM library routines manage the in-terface between global and local data, mesh partitioning,and synchronizing data. This frees the developer to concen-trate on application development without the bookkeepingheadache or parallel performance concerns.

Dale ShiresComputer Scientist, High Performance ComputingDivisionU. S. Army Research [email protected]

Brian HenzUS Army Research [email protected]

MS28

Generic Interface for Parallel Mesh Infrastructure

Generic programming enables to develop generalized soft-ware components that are easily reusable and adaptablein a wide variety of situations. In the development of aSCOREC parallel mesh infrastructure, the FMDB, the de-sign of generic interface for parallel mesh database includ-ing the mesh migration and dynamic mesh load balancingprocedure which enables the mesh infrastructure reusableand easily refinable as possible with various needs of par-allel mesh-based applications is presented.

E. Seegyoung SeolSCOREC, [email protected]

Mark S. ShephardRensselaer Polytechnic InstituteScientific Computation Research [email protected]

MS28

Scalable Parallel Unstructured Meshing in Par-FUM

Unstructured meshes are used in many computationallyintensive engineering applications. The dynamic behav-ior of such applications makes the development of scal-able parallel software challenging. The Charm++ Paral-lel Framework for Unstructured Meshes (ParFUM) allowsapplications to use parallel unstructured meshes with min-imal parallel computing knowledge, and scale to hundredsof processors. Charm++’s message-driven model enablesadaptive communication/computation overlap, and incor-porates dynamic load balancing. ParFUM includes ad-vanced meshing capabilities, such as parallel mesh adap-tivity and collision detection.

Terry WilmarthComputational Science and EngineeringUniversity of Illinois at [email protected]

Laxmikant KaleUniversity of Illinois at [email protected]

MS29

Asynchronous Parallel Pattern Search in the Con-text of a Globally Convergent Augmented La-grangian Method

Kolda, Lewis, and Torczon have proposed an augmentedLagrangian algorithm that uses a linearly-constrained pat-tern search method to solve the subproblem. This algo-rithm is globally convergent but does not require deriva-tives for the objective or the nonlinear constraints. In thistalk, we discuss an asynchronous parallel implementationusing APPSPACK. Parallelism is important for real-lifeproblems where the function and constraint evaluation areoften computationally expensive. We will describe the par-allel algorithm and present numerical results.

Josh GriffinSandia National [email protected]

MS29

An Introduction to Parallel Optimization Methods


Juan MezaLawrence Berkeley National [email protected]

MS29

Using Task Priorities to Design Dynamic, Dis-tributed Numerical Optimization Algorithms forNon-Dedicated, Heterogeneous Computing Envi-ronments

Many algorithms for solving numerical optimization prob-lems proceed by dynamically generating, and modifying, apool of tasks used to direct the search. Such algorithmsare attractive candidates for distributed computing sincethey generate a large number of computationally intensivetasks. The challenge is that the dynamic nature of thetask pool makes it difficult, a priori, to devise either exter-nal scheduling schemes or internal load balancing schemesthat are effective. We discuss a strategy to determine taskplacement by combining forecasts about the system withpriorities assigned to tasks. We give a simple example ofsuch an algorithm and report some preliminary results.

Virginia J. TorczonCollege of William & MaryDepartment of Computer [email protected]

Anne ShepherdCollege of William and [email protected]

MS29

Parallel Interior-Point Methods Newline for LargeSemi-Definite Programming

Semi-Definite Programming (SDP) is an optimization witha linear objective function over positive semidefinite con-straints of symmetric matrices. SDP solvers provide ef-ficient methods to solve Linear Matrix Inequality andto compute the ground-state energy in quantum chem-istry. To solve extremely large SDPs, the parallel SDPsolver (SDPARA) mainly reduces the computation time of

70 PP06 Abstracts

the Schur complement matrix in primal-dual interior-pointmethods owing to parallel computation power. We alsointroduce SDPA online solver which enables users withoutparallel computation environment to solve SDPs via theInternet.

Masakazu Kojima, Mituhiro FukudaTokyo Institute of TechnologyDept of Math & Comp [email protected], [email protected]

Katsuki FujisawaTokyo Denki [email protected]

Kazuhide NakataTokyo Institute of [email protected]

Makoto YamashitaKanagawa UniversityDepartment of Industrial Management and [email protected]

MS30

Parallel Multilevel Preconditioners for Simulationson Adaptively Refined Meshes

Adaptive mesh refinement (AMR) is important for prob-lems where the gradient of the solution varies significantlyin space and time. AMR offers accurate solution againstmodest cost since we use high resolution only where neces-sary. However, adaptive mesh refinement poses a numberof challenges for (parallel) sparse iterative solvers, in par-ticular for the preconditioner. While frequent local changesin the mesh and load balancing make most precondition-ers impractical, sparse approximate inverse preconditioners(SPAI) remain efficient. They are easily updated for localchanges in the mesh and the cost of multiplying by the pre-conditioner does not change much. However, SPAI needs alarge number of nonzero coefficients per column to addressthe global coupling of the equations and reduce the numberof iterations sufficiently. Unfortunately, this tends to makethe preconditioner expensive. Therefore, we are develop-ing multilevel SPAI preconditioners that exploit the AMRstructure to provide global coupling at very low cost.

Shun WangUniversity of Illinois at [email protected]

Eric De SturlerUniversity of Illinois at Urbana-ChampaignDepartment of Computer [email protected]

MS30

Parallel Multilevel Iterative Linear Solvers for Het-erogeneous Field with Adaptive Mesh Refinement

Multilevel method is a promising scalable method for large-scale scientific simulations. Although Gauss-Seidel methodis a typical way for smoothing at each level, some pre-conditioners such as ILU(0) and SPAI can be applied assmoothers. We study various types of smoothers for ourmultilevel method applied to ill-conditioned groundwaterproblems with heterogeneous field properties and adaptivemesh refinement. We show the robustness and scalability

of these smoothers, including comparison with the single-level method.

Kengo NakajimaThe University of TokyoDepartment of Earth & Planetary [email protected]

MS30

On Preconditioners for Complex Symmetric Sys-tems of Linear Equations in a Parallel Eigensolver

We consider efficiency and effectiveness of precondition-ers for complex symmetric linear equations that arise fromsparse eigenvalue problems of large scale molecular orbitalcomputation. We implement an incomplete factorizationpreconditioner with diagonal shifting only for imaginarypart, and a complex version of sparse approximate inversepreconditioner. We report the performance of our meth-ods applied to our master-worker type parallel eigensolver,which runs on multiple PC clusters connected through ahybrid of MPI and GridRPC system.

Hiroto Tadano, Masayuki OkadaGraduate School of Systems and Information ScienceUniversity of [email protected],[email protected]

Tetsuya SakuraiDepartment of Computer ScienceUniversity of [email protected]

Umpei NagashimaResearch Institute of Computational ScienceNational Institute of Advanced Industrial Science [email protected]

MS30

Parallel Incomplete Cholesky Preconditioner withSelective Sparse Approximate Inversion

We consider drop-threshold incomplete Cholesky (ICT)preconditioners for the Conjugate Gradient method ondistributed memory multiprocessors. Our preconditionercombines sparse factorization with selective use of sparseapproximate inversion. The latter is used in certain dis-tributed supernodal submatrices to enable latency tolerantpreconditioner construction and application. We presentresults on the performance of our preconditioner on largesparse linear systems from a variety of practical applica-tions.

Padma RaghavanThe Pennsylvania State Univ.Dept of Computer Science [email protected]

Keita TeranishiDepartment of Computer Science and EngineeringThe Pennsylvania State [email protected]

MS31

A Model for Dynamic Load Balancing on Hetero-

PP06 Abstracts 71

geneous and Non-dedicated Clusters

As clusters become increasingly popular alternatives tocustom-built parallel computers, they expose a growingheterogeneity in processing and communication capabili-ties. Performing an effective load balancing on such en-vironments can be best achieved when the heterogeneityfactor is quantified and appropriately fed into the load bal-ancing routines.

In this work, we discuss an approach based on constructinga tree model that encapsulates the topology and the capa-bilities of the cluster. Thedifferent components of the exe-cution environment are dynamically monitored and theirprocessing and communication capabilities are collectedand then used to assign partition sizes in load balancing.

We used the model, called DRUM, to guide load balancingin the adaptive solution of threenumerical problems. Theresults showed a clear benefit from using DRUM, in clus-ters with computational or network heterogeneity as wellas in non-dedicated clusters. We also showed that DRUMhas a total overhead of less than 0.2% of total executiontime.

Joseph E. FlahertyDept. Computer ScienceRensslaer [email protected]

Jamal FaikOracle USA [email protected]

Jim TerescoDepartment of Computer ScienceWilliams [email protected]

MS31

Parallel Thermal Computation Model of a PEMFuel Cell

In this work, the thermal management problem of protonexchange membrane fuel cell (PEMFC) is addressed. Fuelcells are electrochemical devices which efficiently convertchemical energy into DC electricity and some heat (ther-mal energy). PEMFCs are the most favoured fuel cell typeto power electric vehicles of the future as a replacement forthe internal combustion engine. However, there are manypractical problems to be overcome to make this technol-ogy economically competitive. Thermal management andminimizing temperature constraints, which add complex-ity and cost to the system, is one of the most importantissues. To understand the phenomenon, thermal computa-tion models of a PEM fuel cell that have been proposed inthe literature involve complex systems of partial differen-tial equations and their resolution can not be done withina reasonable amount of time. Recently, a new model basedon the nodal network simulation approach was proposed.In this model, the system is divided into elementary vol-umes that are dependent of each other according to con-ductance conditions. In this paper, a parallel algorithm tosolve this thermal computation model is proposed togetherwith parallel implementations using PVM and MPI.

Salah Abdelkrim, Abdellah El Moudni, Jaafar GaberLaboratoire SeTUniversite de Technologie de Belfort [email protected], [email protected],

[email protected]

Rachid OutbibLaboratoire L2ESUniversite de Technologie de Belfort [email protected]

MS31

Cluster Computing for a System of Time-Dependent Reaction-Diffusion Equations on aThree-Dimensional Domain with High Resolution

The flow of calcium ions in one human heart cell can bemodeled by a system of time-dependent reaction-diffusionequations. The flow is driven by sources at large numbersof discrete points in the three-dimensional domain. The useof parallel clusters allows for the resolution of these points.We present a special-purpose code that exhibits excellentscalability up to at least 32 processors on a distributed-memory cluster with high-performance interconnect.

Matthias K. GobbertUniversity of Maryland, Baltimore CountyDepartment of Mathematics and [email protected]

MS31

Partitioning Algorithms for Parallel Applicationson Heterogeneous Architectures

Existing partitioning algorithms provide limited supportfor load balancing simulations that are performed on het-erogeneous parallel computing platforms. On such archi-tectures, effective load balancing can only be achieved ifthe graph is distributed so that it properly takes into ac-count the available resources (CPU speed, network band-width). We developed a graph partitioning algorithm thatcan address the partitioning requirements of the scientificcomputations, and can correctly model the architecturalcharacteristics of emerging hardware platforms.

George KarypisUniversity of Minnesota / [email protected]

Irene MoulitsasUniversity of Minnesota and University of [email protected]

MS32

Large-Scale First-Principles Molecular DynamicsSimulations on the BlueGene/L Computer

We present the results of large-scale First-Principles Molec-ular Dynamics (FPMD) applications performed on theBlueGene/L supercomputer using a plane-wave, pseudopo-tential approach. Simulations were run on up to 65,536nodes, using up to 131,072 processors, demonstrating thatan efficient parallelization of plane-wave based FPMD ispossible. Details of the algorithms used and their paral-lelization strategy will be discussed.

Francois GygiUniversity of California, DavisLawrence Livermore National [email protected]

72 PP06 Abstracts

MS32

Large-Scale Parallel Molecular Dynamics Simula-tions of Biomolecules

Scaling simulations of biomolecules is challenging: The sys-tems of interests have a fixed size. Typical simulationsinvolve 10k to a few 100k atoms. Parallelizing a smallamount of computation of each timestep on a large num-ber of processors is thus challenging. This talk will reviewtechniques used in NAMD to address this fine-grain paral-lelization issue, and present performance where 92k atomsimulations are parallelized over several thousand proces-sors, taking only a few milliseconds per step.

Gengbin Zheng, Eric Bohm, Laxmikant KaleUniversity of Illinois at [email protected], [email protected], [email protected]

Jim PhillipsBeckman InstituteUniversity of Illinois at [email protected]

MS32

All Electron Calculations of Biomolecules Usingthe Fragment Molecular Orbital (FMO) Methodand Massively Parallel Computers

The FMO method enables one to calculate electronic struc-tures of very large molecules. In the method, a system isdivided into small fragments and ab initio MO calcula-tions are performed on the fragments and their dimers toobtain the energy and the properties of the entire system.The FMO method was applied to the electronic structurecalculation of photosynthetic reaction center complex ofRhodopseudomonas viridis. The protein complex contains20,581 atoms and 77,754 electrons. The single point en-ergy calculation at the FMO-RHF/6-31G* level took 3days using 600 2.0GHz Opetron CPUs.

Kazuo KitauraNational Institute of Advanced Industrial Scienceand Technology (AIST), [email protected]

MS32

Multibillion-Atom Chemically Reactive and Non-reactive Molecular Dynamics Simulations of Mate-rials

We present our linear-scaling embedded divide-and-conquer (EDC) algorithms: fast reactive force-field (F-ReaxFF) molecular dynamics (MD) and density functionaltheory (DFT) on adaptive multigrids for quantum me-chanical MD. We parallelized these algorithms based ona tunable hierarchical cellular decomposition framework.On 1920 Itanium2 processors, we achieved 0.56 billion-atom F-ReaxFF and 1.4 million-atom (0.12 trillion gridpoints) EDC-DFT, in addition to 18.9 billion-atom space-time multiresolution MD, with parallel efficiency as highas 0.953.

Rajiv K. Kalia, Aiichiro Nakano, Priya VashishtaUniversity of Southern [email protected], [email protected], [email protected]

MS33

Nonlinear Solvers in Reduced Resistive Magneto-

hydrodynamics


David KeyesColumbia UniversityAPAM [email protected]

MS33

One and Two-Level Schwarz Methods in ReducedResistive Magnetohydrodynamics.

We discuss a parallel fully implicit Newton-Krylov-Schwarzalgorithm for solving a magnetohydrodynamics problemin two-dimensional space. Current density sheets becomenearly singular in the process of magnetic reconnection.This behavior of the solution limits time step sizes in ex-plicit schemes. Our approach is a fully implicit time in-tegration using Newton-Krylov techniques with one- andtwo-level additive Schwarz preconditioning. We study par-allel convergence of the implicit algorithm as implementedand run on an IBM BG/L supercomputer.


Florin DobrianColumbia [email protected]

Serguei OvtchinnikovUniversity of Colorado at BoulderDepartment of Computer [email protected]

David KeyesColumbia UniversityAPAM [email protected]

MS33

Parallel Issues in Reduced Resistive Magnetohy-drodynamics Using Implicit AMR


Michael [email protected]

MS33

Adaptive Mesh Simulations of Tokamak Refueling


Ravi SamtaneyPrinceton Plasma Physics [email protected]

MS34

Finite Element Electronic Structure Code Based

PP06 Abstracts 73

on Existing Parallel AMR


Jean-Luc FattebertLawrence Livermore National [email protected]

MS34

Efficiency of Plane-Wave Density Functional Cal-culations on the Cray X1E

Plane-wave density functional calculations are a broadlyapplicable workhorse method in nanoscience, materials sci-ence, and condensed matter physics, owing to their ac-curacy and applicability to large systems. Computation-ally, execution time is dominated by linear algebra andthe evaluation of Fourier transforms. Parallel efficiencyis primarily governed by the efficiency of parallel Fouriertransforms, over all or a subset of processors. These meth-ods traditionally achieve a high fraction of peak perfor-mance on scalar architectures. Here, we discuss the adap-tation and efficiency of common plane wave packages tothe vector-based Cray X1E, as well as the introduction ofan optimized Fourier Transform kernel that can be readilyintegrated in plane wave density functional theory codes.Work supported by the Laboratory Directed Research andDevelopment Program of Oak Ridge National Laboratoryas well as the Scientific User Facilities Division, U. S. De-partment of Energy. Work performed in collaboration withT.C Schulthess, M Fahey (ORNL), J. Levesque, N Troullier(Cray).

Paul KentUniversity of Cincinnati, Oak Ridge National [email protected]

MS34

Diagonalization Methods in Real Space DensityFunctional Theory

Density Functional Theory (DFT) is a successful tech-nique used to determine the electronic structure of mat-ter which is based on a number of approximations. Itconverts the original n-particle problem into an effectiveone-electron system, resulting in a coupled one-electronSchrodinger equation and a Poisson’s equation. This cou-pling is nonlinear and rather complex. It involves a chargedensity ρ which can be computed from the wavefunctionsψi, for all occupied states. However, the wavefunctions ψi

are the solution of the eigenvalue problem resulting fromSchrodinger’s equation whose coefficients depend nonlin-early on the charge density. This gives rise to a non-lineareigenvalue problem which is solved iteratively. The chal-lenge comes from the large number of eigenfunctions to becomputed for realistic systems with, say, hundreds or thou-sands of electrons. We will discuss a parallel implementa-tion a finite difference approach for this problem with anemphasis on diagonalization. We will illustrate the tech-niques with our in-house code, called PARSEC. This codehas evolved over more than a decade as features were pro-gressively added and the diagonalization routine, which ac-counts for the biggest part of a typical execution time, wasupgraded several times. We found that it is importantto consider the problem as one of computing an invari-ant subspace in the non-linear context of the Kohn-Shamequations. This viewpoint leads to considerable savingsas it de-emphasizes the accurate computation of individ-ual eigenvectors and focuses instead on the subspace which

they span.

Yousef SaadDepartment of Computer ScienceUniversity of [email protected]

MS34

Eigensolvers for Large Electronic Structure Calcu-lations

The solution of the single particle Schroedinger equationthat arises in electronic structure calculations often re-quires solving for interior eigenstates of a large Hamilto-nian. The states at the top of the valence band and atthe bottom of the conduction band determine the bandgap that relates to important physical characteristics suchas optical or transport properties. In order to avoid theexplicit computation of all eigenstates, a folded spectrummethod is employed to only compute the eigenstates nearthe band gap. In this talk, we compare the folded spectrummethod with other techniques, such as conjugate gradientminimization, locally optimal block preconditioned conju-gate gradient (LOBPCG) and Jacobi-Davidson.

Osni A. MarquesLawrence Berkeley National LaboratoryBerkeley, [email protected]

Christof VoemelLawrence Berkeley National LabScientific Computing Group, Computational [email protected]

MS35

A Distributed Simulation Framework for ScalableLocal-Adaptive Multigrid

Numerical simulation using the parallel-adaptive comput-ing paradigm is a challenging task, especially if distributedmemory machines and unstructured meshes are addressed.We present a simulation platform that provides all capabil-ities to perform highly scalable, large-scale computationsunder a local-adaptive solution strategy with an optimalcomplexity multigrid solver as kernel. Several aspects willbe discussed in more detail, e.g. dynamic load balancingand migration. For real-world examples we show the ma-turity of the developed concepts and their realization.

Stefan LangIWR,Uni [email protected]

MS35

Composition of Components in Multiphysics Ap-plications

Complex multiphysics applications often need to overcomeseveral challenges to achieve accurate and efficient com-position of simulation components, including compatibil-ity between data structures, programming languages andparallel decompositions. We describe mechanisms that en-able composition by employing a component-based archi-tecture where parallelism and communication concerns aredelegated to a separate component. This system enables

74 PP06 Abstracts

complex multiphysics applications that involve wide rangesof length and time scales through transitions in solutionalgorithms, adaptive mesh-refinement, and scalable paral-lelism.

Steven G. ParkerScientific Computing and Imaging InstituteUniversity of [email protected]

MS35

Parallel Data Management and Communication inthe SAMRAI Structured AMR library

Adaptive mesh refinement (AMR) is becoming an increas-ingly important simulation methodology for many scienceand engineering applications. Use of AMR introducesmany complexities on parallel systems, both in program-ming and achieving efficient performance. We describethe data management and communication infrastructureused in SAMRAI. In particular, we will discuss softwaredesigns that facilitate communication different datatypeson complex mesh configurations, load balancing problemsthat have non-uniform workloads, and techniques to reduceadaptive grid overheads on large-scale parallel computersystems.

Rich HornungLawrence Livermore National [email protected]

Andrew M. WissinkCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

MS36

Parallel Delaunay with 10 Billion Tetrahedrons


Guy BlellochCarnegie Mellon [email protected]

MS36

Parallel Delaunay Mesh Generation and Refine-ment using COTS

Delaunay refinement is a popular method for generatingguaranteed quality triangular and tetrahedral meshes. Itis based on inserting additional (Steiner) points to meet theelement quality requirements. We address the fundamentaltheoretical problem of resolving the dependencies amongthe candidate Steiner points as well as the basic practicalissue of using the available high quality sequential softwarein the parallel setting. The goal of this talk is to proposea clear theoretical foundation for developing practical par-allel Delaunay refinement algorithms. First, we state thealgorithm correctness requirements. Any parallel point in-sertion procedure should preserve the Delaunay refinementloop invariant, i.e. the mesh has to be conformal (simpli-cial) and has to satisfy the Delaunay criterion. We showunder which circumstances these requirements can be vio-lated. Second, we propose a new definition of Steiner pointindependence in the context of the parallel Delaunay refine-ment. We say that two points are Delaunay-independentif and only if they can be inserted into the mesh concur-

rently without violating the correctness of the algorithm.Third, we formulate and prove a criterion and a sufficientcondition of Delaunay-independence. Finally, we proposea parallel Delaunay refinement algorithm which does notrequire the restructuring of the sequential meshing kernel,and thus allows to reuse the available sequential commer-cial off-the-shelf software (COTS).

Andrey ChernikovCollege of William and [email protected]

MS36

Parallel Hex Mesh Generation: An Overview


Tim TautgesSandia National [email protected]

MS36

Phoenix: A Computational Database System forGenerating Massive Tetrahedral Meshes

High-performance parallel computers have created un-precedented new opportunities for scientists and engineersto simulate complex physical phenomena on a larger scaleand at a higher resolution than heretofore possible. Un-structured meshes have, accordingly, grown larger andlarger. The scalability of a mesh generator hence often im-poses a limit on the size of the problem that can be solved.We propose a database method to achieve scalability. Ourkey idea is to build a new software system, called a com-putational database system, which maps mesh geometriesto existing and new database structures, and execute meshgeneration algorithms, such as Delaunay meshing, througha sequence of highly-optimized operations on the underly-ing database structures. This new approach allows us togenerate massive tetrahedral meshes on workstations withlimited main memory in a reasonable amount of time. Be-sides, a mesh thus generated is organized and indexed ina database structure that is capable of supporting efficientqueries for many other purposes.

Tiankai TuCarnegie Mellon [email protected]

MS37

An Overview of Support Preconditioning

Support preconditioning is a powerful but not well knownclass of preconditioning techniques for sparse, SPD linearsystems. In contrast to empirical techniques, support the-ory provides bounds on the generalized eigenvalues (con-dition numbers). Many support preconditioners are basedon graph algorithms, and the field is rich in combinatorialproblems. We give a brief overview of different types ofsupport preconditioners. Also, we describe and comparegraph-based preconditioners that have been independentlydeveloped for network optimization. Finally, we discussopen problems and some parallel computing issues.

Bruce HendricksonSandia National [email protected]

Erik G. Boman

PP06 Abstracts 75

Sandia National Labs, NMDiscrete Math and [email protected]

MS37

Efficiently Solving Linear Systems using Hierarchi-cal Decompositions

We focus on algorithms for solving symmetric diagonally-dominant linear systems. It is known that the solutionof these systems can be reduced to Laplacians of graphs.Fifteen years ago, Vaidya showed that one could carefullychoose a subgraph of any given graph and use it as a pre-conditioner with non-trivial convergence guarantees. Thisthread of work culminated recently in the results of Spiel-man and Teng, who gave almost optimal algorithms. Ina parallel thread of work, Gremban and Miller proposedthe construction of preconditioners that may not be sub-graphs of the original graph. They called these precondi-tioners support trees. The construction and analysis ofsupport trees presented additional difficulties that werelargely overcome by Maggs et al., who found a key connec-tion with hierarchical decompositions of graphs proposedby Racke in his seminal 2002 paper. Maggs et al. showedthat Racke’s decomposition tree can be used as a supporttree with convergence guarantees that exceed those of anygiven subtree. We extend this line of work by develop-ing tools for analyzing support graphs. We observe thatRacke’s decomposition has a collection of properties thatallow the design of a sequence of support graphs whichcan be used in recursive iterative algorithms whose conver-gence guarantees always improve those of Maggs et al., andmatch or slightly improve Spielman’s and Teng’s, at leastfor a wide and practical class of graphs.

Yiannis [email protected]

Bruce [email protected]

R. RaviTepper School of [email protected]

Maverick [email protected]

Gary MillerCarnegie Mellon [email protected]

Ojas ParekhEmory [email protected]

MS37

Finite Element Matrix Approximations and Graph-based Parallel Preconditioners

We consider the problem of constructing support graph

preconditioners for matrices arising from finite element ap-proximation of partial differential equations. It is knownthat the standard support graph preconditioning approachis not applicable to finite element stiffness matrices. Wehave developed a technique that uses element-level trans-formations to convert the coefficient matrix into one thatis amenable to preconditioning by support graphs. Ourtalk will outline these transformations for typical finite el-ement bases used for diffusion-type problems and discussthe resulting effect on the quality of the support graphs.We will describe approaches to construct support graphsthat can be used as parallel preconditioners and presentexperimental results.

Vivek SarinTexas A&M UniversityDepartment of Computer [email protected]

MS37

Algebraic Support Approximations for Finite Ele-ment Matrices

I will describe two techniques for algebraically approxi-mating finite-element matrices using support-theory tools.The first technique, called diagonally-dominant elementapproximation, is applicable to matrices arising from scalarelliptic problems. It is similar to recent constructions byBoman-Hendrickson-Vavasis and by Sarin, except that itdoes not require a particular factorization of the elementmatrices. The second technique, called fretsaw precondi-tioning, is applicable to a wider class of problems, includinglinear elasticity. It is also purely algebraic.

Sivan A. ToledoTel Aviv UniversitySchool of Computer [email protected]

Doron ChenTel-Aviv [email protected]

Gil ShklarskiSchool of Computer ScienceTel-Aviv [email protected]

MS38

An Architecture for Reconfiguring Iterative MPIApplications in Dynamic and Heterogeneous Envi-ronments

With the proliferation of large scale dynamic executionenvironments such as grids, the need for providing effi-cient and scalable application adaptation strategies for longrunning parallel and distributed applications has emerged.Message passing interfaces have been initially designedwith a traditional machine model in mind, which assumeshomogeneous and static environments. It is inevitable thatlong running message-passing applications will require sup-port for dynamic reconfiguration to maintain high per-formance under varying load conditions. We describe aframework that provides iterative MPI applications withreconfiguration capabilities. Our approach is based on inte-grating MPI applications with a middleware that supportsprocess checkpointing, process migration, and decentral-ized strategies for application reconfiguration. We presentour architecture for reconfiguring MPI applications and our

76 PP06 Abstracts

experimental evaluation.

Boleslaw Szymanski, Carlos VarelaRensselaer Polytechnic Institutenot submitted, not submitted

Kaoutar El MaghraouiDepartment of Computer ScienceRensselaer Polytechnic [email protected]

MS38

Automated Dynamic Load Balancing Methods forParallel Scientific Computing

Most load balancers only active at a certain predefinedstep. With highly adaptive computations, and changingcomputing environments, this can lead to unbalanced com-putation, as the next load balancing phase may not occurfor some time. A more automated system will allow the ac-tual load of the computing environment to determine whento go into the next load balancing phase. We examine howto create a cost model, and find a heuristic for this opti-mization problem.

Luis GervasioRensselaer Polytechnic InstituteDepartment of Computer [email protected]

Joseph E. FlahertyDept. Computer ScienceRensslaer [email protected]

James D. TerescoDepartment of Computer ScienceWilliams [email protected]

Jamal FaikOracle USA [email protected]

MS38

Decentralized Data Management Framework forDistributed Environments

The tremendous growth in data requirements for scientificapplications stresses the need for new data placement algo-rithms. In this talk we introduce distributed, adaptive, andscalable middleware that provides efficient access to datain distributed environments. At the core of our approachare dynamic data placement and location techniques thatadapt replica creation to the continuously changing net-work connectivity and users behavior. The results showthat our solution provides better data access performancewith lower resource consumption rates.

Boleslaw SzymanskiComputer Science DepartmentRensselaer Polytechnic [email protected]

Houda LamehamediDepartment of Computer ScienceRensselaer Polytechnic [email protected]

MS38



Andrew LumsdaineOpen Systems LaboratoryIndiana [email protected]

MS39

Optimal Algorithm Choice Through StatisticalModeling


Victor EijkhoutUniversity of Texas at AustinTexas Advanced Computer [email protected]

MS39

Easy Evaluation Way of ILU Preconditioning Ef-fects for Automatically Tuned Iterative LinearSolvers

In Incomplete LU (ILU) preconditioning, orderings and alevel of fill-in’s affect the effect of preconditioning. The au-thor recently proposed a simple evaluation way for the ILUpreconditioning effect. The evaluation index, which has asimple relationship with the matrix norm of the remain-der matrix, is easily computed without additional memoryrequirement. The effectiveness of the method is examinedby numerical tests using coefficient matrix data from theMatrix Market and a 3-d electromagnetic field analysis.

Takeshi IwashitaAcademic Center for Computing and Media StudiesKyoto [email protected]

MS39

ABCLibScript: A Directive to Specify Auto-tuningFacility—Its API and The Application Examplesfor Several Numerical Methods

A directive named ABCLibScript to specify auto-tuningprocesses for numerical software is presented. The main ob-jects are performance parameters for loop unrolling depth,cache blocking size, and algorithm selection for dense andsparse kernels. We limit its function to the three specifica-tions: Unroll, Variable, and Select. The API and examplesare also shown. A prototype of the preprocessor for AB-CLibScript and a visualizer to support user’s descriptionwill be demonstrated in this presentation.

Takahiro KatagiriUniversity of California at BerkeleyThe University of [email protected]

Kenji Kise, Hiroki Honda, Toshitsugu YubaThe University of [email protected], [email protected], [email protected]

MS39

Automatic Selection of Preconditioners for Library

PP06 Abstracts 77

of Sparse Iterative Solvers

High performance numerical libraries containing automaticselection are highly recommended. We focus on two itera-tive solvers that use the conjugate gradient (CG) methodand the generalized minimal residual (GMRES) method.They are suitable for solving large sparse matrix equations.Our libraries select an appropriate preconditioning methodfor each type of matrix and target architecture in order toreduce computation time as much as possible. This auto-matic selection is performed at runtime.

Hisayasu KurodaThe University of [email protected]

MS40

Normalized Cuts Without Eigenvectors: A Multi-level Approach

Graph partitioning is widely used in scientific computingand parallel processing. Typically the goal is to partitionthe nodes of a graph so that the graph cut between thepartitions is minimized while constraining the partitionsto be (nearly) equal in size. Recently, graph partitioning(also called graph clustering) has been used in varied dataclustering problems in data mining and machine learning.However, in these applications, there is no inherent need forthe graph partitions to be equal in size. Hence weightedgraph cuts, such as Normalized Cuts, have been used asthe objectives to be minimized. Typically, spectral cluster-ing algorithms that require eigenvector computations areused for this task. In this talk, we first show a theoreticalequivalence between the objective function of minimizingweighted graph cuts, and the objective function used in theweighted kernel k-means clustering formulation. However,kernel k-means is prone to local minima and initializationproblems. To avoid these problems, we have developeda multilevel approach for efficient and robust minimizationof weighted graph cuts that exploits the above equivalence.Experimental results on large data sets underscore the ef-fectiveness of our multilevel approach.

Inderjit S. DhillonDepartment of Computer SciencesUniversity of Texas, [email protected]

MS40

Multilinear Algebra for Link Analysis

Linear algebra is a powerful and proven tool in web search.Typically, techniques score web pages based on the princi-pal eigenvector (or singular vector) of a non-negative ma-trix that captures the hyperlink structure of the web. Wepropose a new methodology, based on a PARAFAC tensordecomposition, that uses multilinear algebra to elicit moreinformation from a higher-order representation of the webgraph. We present numerical results and discuss the com-putational challenges of scaling these techniques.

Brett W. BaderSandia National LaboratoriesComputational Sciences [email protected]

Tamara G. KoldaSandia National [email protected]

MS40

Sampling Methods for Data Applications of Linearand Multilinear Algebra

An emerging paradigm for accessing, computing with, andlearning from large complex data sets involves the princi-pled use of “sampling”. In this approach, one uses only asmall portion of the data, and one performs computationsof interest for the full dataset by using the small portion asa surrogate. We discuss recently-developed random sam-pling algorithms for structure identification and extractionfrom data sets that may be naturally modeled within alinear-algebraic or multi-linear-algebraic framework.

Michael MahoneyYahoo! [email protected]

MS40

Numerical Linear Algebra for Data and Link Anal-ysis

Modern information retrieval and data mining systems op-erate on large datasets and require efficient, robust andscalable algorithms. Numerical linear algebra provides asolid foundation for them. In this talk I will discuss a lin-ear system formulation of Pagerank, a web page rankingalgorithm, and Krylov subspace methods for its efficientsolution. I will describe a scalable parallel implementationof PageRank and present convergence results of numericalexperiments on multiple web graphs.

Leonid ZhukovYahoo! [email protected]

David [email protected]

MS41




MS41

Lfc: Parallel Computing with Sequential and In-teractive Interface

Interactive environments such as Mathematica, Matlab, orMaple offer computational capabilities that are limited tosequential execution. We present a framework that ex-tends these capabilities to parallel execution. Our uniquedesign is independent of the interactive environment. Weenvision a possibility to switch between interactive environ-ments while keeping the data on the parallel server. Weuse Lapack For Clusters package that utilizes ScaLAPACKon the server and can be accessed from Matlab or Pythonclients.

Piotr LuszczekDepartment of Computer ScienceUniversity of Tennessee, [email protected]

78 PP06 Abstracts

MS41

Use of LAPACK and ScaLAPACK in MATLAB

After reviewing how MATLAB 7.0 makes use of LAPACKfor matrix computation on single processor machines, wewill describe how future versions of MATLAB could makeuse of ScaLAPACK in distributed memory, multiprocessorenvironments.

Cleve MolerThe MathWorks, [email protected]

MS41

Parallelizing the MRRR algorithm for ScaLA-PACK

We present PDSYEVR, the ScaLAPACK version of theparallel MRRR algorithm for the symmetric eigenvalueproblem. Depending on progress until the conference, wewill also talk about the parallel MRRR-based SingularValue Decomposition.

Christof VoemelLawrence Berkeley National LabScientific Computing Group, Computational [email protected]

MS42

Distributed Optimization for Complex System De-sign

Design of complex systems governed by computational sim-ulations may require distributed optimization approachesfor reasons of performance or tractability. In this case aconventional, “non-parallel” nonlinear programming prob-lem formulation is possible, in principle, but a distributedformulation is preferable. In other cases, the components ofa complex system are inherently nearly autonomous and re-quire a distributed approach to design and control becausenon-distributed formulations are unsuitable. We investi-gate distributed optimization formulations and attendantalgorithms for both classes of problems.

Natalia AlexandrovNASA Langley Research [email protected]

MS42

Aspects of Optimization in Biomedical Flow Simu-lation

Several challenges and methods in biomedical flow devicedesign are described. The objective often involves theunique behavior of blood as the flowing medium, neces-sitating, e.g., accurate modeling of cell damage. The com-plex constitutive behavior, in particular shear-thinning,may affect the outcome of shape optimization more thanit affects direct flow analysis. Finally, target applicationsoften involve intricate time-varying geometry, and thus, re-alistic solutions can be only obtained on high-performanceparallel computers.

Mike NicolaiRWTH Aachen [email protected]

Marek Behr

RWTH Aachen UniversityChair for Computational Analysis of Technical [email protected]

MS42

Topology Optimization of Fluid Problems based onthe Lattice Boltzmann Method

The geometry of a body submerged in a Navier-Stokes flowis optimized for a given performance functional. A newtopology optimization approach is presented modeling theflow by the Lattice Boltzmann method (LBM). The LBMis a natural match with topology optimization as in bothcases the geometry of a submerged body is defined in an“on-off” mode of computational cells of a fixed regular grid.The LBM eases parallel implementation, features scalabil-ity, and is able to deal with complex flows. The proposeddesign optimization method is verified by numerical studiesof 2D and 3D steady-state problems.

Georg Pingen, Anton Evgrafov, Kurt MauteUniversity of Colorado at [email protected], [email protected],[email protected]

MS42

Parallel Hessian Approximations in the Optimiza-tion of Systems Governed by P.D.E.

We discuss some parallel approaches to approximating theHessian in PDE-constrained optimization. These approxi-mations are based on decompositions that take advantageof locality of effects in space and in length-scale. Thesedecompositions may be changed during the course of exe-cution. We discuss techniques for determining when sucha change might be advantageous, taking into account thetrade-off between acceleration of the optimization and thecost of interprocessor data migration.

Robert M. LewisCollege of William and [email protected]

MS43

Parallel Adaptive Low Mach Number Simulationof Reacting Flows

Numerical simulation of reacting flows with comprehen-sive kinetics is one of the most demanding areas of com-putational fluid dynamics. High-fidelity modeling requiresaccurate fluid mechanics, detailed models for microcompo-nent transport and detailed chemical mechanisms. Spa-tial and temporal requirements based on integration of thecompressible reacting Navier Stokes equations on a uniformmesh lead to prohibitive estimates for the computationalcost. In this talk we describe methodology to reduce theserequirements based on combining a low Mach number for-mulation with a local adaptive mesh refinement algorithm.The low Mach number equations, which are structurallysimilar to the incompressible Navier-Stokes equations, en-able a larger time step while adaptive refinement reducesthe number of degrees of freedom needed to represent thesolution. However, integration of the low Mach numberequations on adaptive grids introduces additional complex-ity into the discretization. We will discuss a variety of algo-rithmic and implementation issues that must be addressedin developing this approach for distributed-memory par-allel architectures. We will present an application of the

PP06 Abstracts 79

resulting methodoloty to the simulation of turbulent pre-mixed flames and provide comparisons between simulationand experiment.

John B. BellCCSELawrence Berkeley [email protected]

MS43

VORTONICS: A Grid-Enabled Software Packagefor the Evolution, Identification, and Visualizationof Navier-Stokes Vortices

VORTONICS is a suite of software components for study-ing vortical motion and reconnection under the incompress-ible Navier-Stokes equations in three dimensions. It in-cludes three separate modules for the Navier-Stokes solver– a multiple-relaxation-time lattice Boltzmann code, an en-tropic lattice Boltzmann code, and a pseudospectral solver.It includes routines for dynamically resizing the computa-tional lattice using Fourier resizing, wherein data on thelattice is Fourier transformed, high-k modes are added ordeleted to increase or decrease the lattice size respectively,and the inverse Fourier transform is taken. It also includesroutines for visualizing vortex cores using thresholding, Qcriterion, Delta criterion, lambda-squared criterion, andmaximal-line-integral definitions.The package is fully par-allel, making use of MPI for interprocessor communication,and it includes routines for arbitrary remapping of dataamongst the processors, so that one may transform frompencil to slab to block decomposition of the data, as de-sired. It also implements a form of computational steering,allowing the user to modify parameters and schedule tasksdynamically. We describe the VORTONICS package withparticular attention to its recent grid implementation usingMPICH-G2 to allow for both task decomposition and geo-graphically distributed domain decomposition of the com-putational lattice.

Bruce Boghosian, Lucas Finn, Christopher KottkeTufts [email protected], [email protected], [email protected]

MS43

Simulation of Blood Flow in Human Arterial Treeon the TeraGrid

Cardiovascular disease accounts for almost fifty percent ofdeaths in the western world. The formation of arterial dis-ease such as atherosclerotic plaques is strongly related tothe blood flow patterns, and is observed to occur prefer-entially in regions of separated and recirculating flow suchas vessel branches and bifurcations. With grid computingtechnology we perform simulations of blood flow in the en-tire human arterial tree through detailed three-dimensionalcomputations at a number of arterial bifurcations, coupledby the wave-like nature of pulse information traveling fromthe heart to arteries that is modeled by a reduced set ofone-dimensional equations. We employ MPICH-G2 andconduct geographically-distributed coupled cross-site sim-ulations at major TeraGrid sites in the US. The algorithmsand computation results will be presented.

Leopold GrinbergBrown [email protected]

Alex YakhotBen Gurion University Negevalex [email protected]

Spencer SherwinImperial College [email protected]

Suchuan Dong, George KarniadakisBrown [email protected], [email protected]

MS43

Compute and I/O Challenges to Large-Scale OceanData Assimilation: Variational and Ensemble-Based Approaches

Data Assimilation (DA) in ocean science applications isa highly compute and I/O intensive process. Among themost powerful modern approaches are variational schemesusing an automatically computed adjoint and sequentialMonte Carlo ensemble-based non-linear (Error SubspaceStatistical Estimation - ESSE) techniques. These types ofmethods are used for example for global scale ocean esti-mation, ocean prediction and parameter estimation. Thesetwo different methods stress modern computer systems inways characteristic of a larger set of DA methods. Wediscuss the significant performance challenges to both re-garding computation and I/O and describe how their dif-ferent needs best fit in a Grid computing environment ofdistributed parallel platforms.

Constantinos EvangelinosMassachusetts Institute of [email protected]

MS44

A Stable Time-Parallel and Coarseless Implicit Al-gorithm for Second-Order Hyperbolic Problems

Recently, it was proved that for ODEs resulting from thesemi-discretization of second-order hyperbolic problems,most published frameworks for time-parallel computationsare inherently unstable. The objective of this paper is topresent an alternative framework for time-parallel compu-tations that retains the desired stability properties of a pre-ferred ODE solver. Superior performance is demonstratedover the parareal and original PITA frameworks for sev-eral linear oscillator problems including one derived fromthe discretization of an F-16 aircraft.

Julien Cortial, Henri Bavestrello, Climene Dastillung,Charbel FarhatStanford [email protected], [email protected], [email protected], [email protected]

MS44

Time Parallel Time Integrators for Nonlinear Prob-lems

Time parallel time integrators received substantial atten-tion over the past years. One reason for this is the arrival ofparallel computers with thousands of processors which leadto saturation of parallelism in space for space-time prob-lems. Another reason is a variant of a time parallel integra-tor called the parareal algorithm, which has been appliedto several problems in industry. I will present an analysis

80 PP06 Abstracts

of the parareal algorithm applied to non-linear problems.

Martin J. GanderUniversity of [email protected]

MS44

Parareal in Time Method: Application to QuantumControl and Multi-Model Simulation

We first present in this talk the application of the pararealalgorithm with the quantum control by laser fields: thecoupling of the monotonic scheme for the optimal controland the parareal algorithm allows for full efficiency of themethodology. Another aspect of the parareal algorithm ispresented by the use of simplified coarse schemes for theserial propagation. An application to chemical reactions isalso given that proves also large speed up.

Yvon MadayUniversite Pierre et Marie [email protected]

MS44

Space-Time Solution of Large-Scale PDE Applica-tions

Software and algorithms are being developed to efficientlyformulate and solve transient PDE problems as ”steady”problems in a space-time domain. In this way, sophisti-cated design and analysis tools for steady problems canbe brought to bear on transient and periodic problems.This new capability is being developed in the Trilinos solverframework, and allows for parallelism over both space andtime domains. Numerical results from a PDE reacting flowapplication will be presented.

Danny DunlavySandia National [email protected]


Eric PhippsSandia National [email protected]

MS45

Continuous Program Optimization

Our research in the context of the DARPA HPCS projectaims at an infrastructure to characterize and understandthe interactions between hardware and software and to af-fect optimizations based on these characterizations. Wehave developed an architecture for Continuous ProgramOptimization (CPO) to assist in, and automate the chal-lenging task of performance tuning a system. CPO utilizesthe data from all hardware and software layers to detect, di-agnose, and eliminate performance problems. We describethe CPO architecture and several CPO agent implementa-tions that we used do demonstrate the capabilities of thisframework.

Calin CascavalIBM Thomas J. Watson Research Center

[email protected]

MS45

Issues in Parallel Middleware Design for Many-Core Chips - Cyclops-64 Experience

In this talk, we present a case study - illustrating the prob-lems facing the parallel middleware design of high-end par-allel computing systems based on a many-core-on-a-chiparchitecture. We discuss several issues including executionmodels with fine-grain multithreading and memory hierar-chy optimization. A design study derived from the experi-ence on IBM Cyclops-64 architecture will be discussed.

Guang GaoComputer Architecture and Parallel Systems LaboratoryDept. of Electrical and Comp. Eng., University [email protected]

MS45

Support for Parallelism in Intel Compilers

Intel Compilers (F95/C++) are used to obtain optimumperformance from programs when run on Intel processors.As Intel processors have increasingly incorporated variouslevels of parallelism for performance, the compilers haveevolved to help programmers exploit this parallelism. Thetalk will describe the many features and their implementa-tion in the compilers to help parallel programming.

Milind GirkarIntel Compiler [email protected]

MS45

Transactional Memory: Architectural Support forPractical Parallel Programming

As chip-multiprocessors replace uniprocessors in all sys-tem types, practical parallel programming is becoming anurgent need. Transactional memory is a new model forshared-memory multiprocessors that relies on user-definedtransactions as the basic unit of parallelism, coherence andconsistency, error atomicity, and optimization. Transac-tional memory simplifies parallel programming by eliminat-ing the need for manual orchestration of parallelism usinglocks. It also provides a flexible environment for dynamicoptimizations and fault-tolerant computing. This talk willintroduce the software and hardware aspects of transac-tional memory techniques developed at Stanford Univer-sity, provide an initial evaluation, and discuss its long-termpotential.

Christos KozyrakisDepartments of Electrical Engineering and ComputerScienceStanford [email protected]

MS45

SmartApps: Adaptive Applications for High Pro-ductivity/High Performance

The performance of current parallel and distributed sys-tems continues to disappoint. This is due partially to theinability of applications, compilers and computer systemsto adapt to their own dynamic changes. It is also due to the

PP06 Abstracts 81

current compartmentalized approach to optimization: ap-plications, compilers, OS and hardware configurations aredesigned and optimized in isolation. No global optimiza-tion is attempted and the needs of each running applica-tion does not constitute the primary optimization consid-eration. To address these problem we propose application-centric computing, or Smart Applications (SmartApps). Inthe SmartApps executable, the compiler embeds most run-time system services, and a optimizing feedback loop thatmonitors the application’s performance and adaptively re-configures the application and the OS/system platform. Atrun-time, after incorporating the code’s input and deter-mining the system’s resources and state, the SmartAppsperforms an instance specific optimization, which is moretractable than a global generic optimization between ap-plication, OS and system. The overriding philosophy ofSmartApps is “measure, compare, and adapt if beneficial.”SmartApps is being developed in the STAPL infrastruc-ture. STAPL (the Standard Template Adaptive ParallelLibrary) is a framework for developing highly-optimizable,adaptable, and portable parallel and distributed applica-tions. It consists of a relatively new and still evolving col-lection of generic parallel algorithms and distributed con-tainers and a run-time system (RTS) through which the ap-plication and compiler interact with the OS and hardware.SmartApps are developed for several important computa-tional science applications including a discrete-ordinatescode simulating subatomic particle transport, a moleculardynamics code and a program using a novel method to sim-ulate protein folding and ligand binding. We use currentASCI shared and distributed machines, networks of work-stations that run under Linux and K42 operating systemsand, as of late, the Blue Gene machine.

Nancy AmatoTexas A & M [email protected]

Lawrence RauchwergerComputer Science DepartmentTexas A&M [email protected]

MS46

GridSolve: A Seamless Bridge Between the Stan-dard Programming Interfaces and Remote Re-sources

GridSolve is a client-server system that enables users tosolve complex scientific problems remotely. The system al-lows users to access both hardware and software resourcesdistributed across a network. GridSolve searches for com-putational resources on a network, chooses the best onesavailable, and using retry for fault-tolerance solves a prob-lem, and returns the answers to the user. A load-balancingpolicy is used by the GridSolve system to ensure good per-formance by enabling the system to use the computationalresources available as efficiently as possible. Our frame-work is based on the premise that distributed computationsinvolve resources, processes, data, and users, and that se-cure yet flexible mechanisms for cooperation and communi-cation between these entities is the key to metacomputinginfrastructures.


MS46

Architecture-Aware Scientific Computing: Im-proving Performance and Power Characteristics

We consider techniques for improving the performance ofscientific computing codes while reducing the energy con-sumed by the hardware. We consider several memory sub-system optimizations and their impact on performance andpower of certain floating-point intensive benchmarks. Ourresults indicate that the right mix of architectural featuresand application tuning can lead to significant reduction inenergy and improved performance.

Konrad MalkowskiPenn State [email protected]

Padma RaghavanThe Pennsylvania State Univ.Dept of Computer Science [email protected]

Mary IrwinThe Pennsylvania State UniversityDepartment of Computer Science and [email protected]

MS46

Interface of Sparse Linear Solver Library Opti-mized for Various Types of Architectures

Hardware-dependent tuning for optimization is an impor-tant issue for extracting power of modern parallel com-puters. Moreover, portability is critical, because users canaccess to variety types of hardware under GRID environ-ment. Sometimes, different algorithm and data structureare required according to individual hardware. In this pre-sentation, features of an infrastructure for efficient develop-ment of optimized and reliable scientific applications, suchas finite-element method are described. Examples for realapplications are also presented.

Kengo NakajimaThe University of TokyoDepartment of Earth & Planetary [email protected]

MS46

EIGADEPT: An Automatic and Application-Based Eigensolver Selection

The automatic eigensolver selection engine, EIGADEPT, iscomposed of an intelligent engine, knowledge base, decisiontree and various numerical libraries. Based on the applica-tion and matrix properties, EIGADEPT will use intelligentengine to consult with knowledge base and decision tree toderive an optimal eigensolver from the available numericallibraries connected to EIGADEPT. User also has the ca-pability to add new library into EIGADEPT or provide hisown algorithm to solve the eigenvalue problem.

Yeliang ZhangUniversity of [email protected]

MS47

Developing Linear Algebra Libraries: Theory and

82 PP06 Abstracts

Practice

The Formal Linear Algebra Methodology Environment(FLAME) project at UT-Austin pursues the mechanicalderivation of dense linear algebra libraries. The ultimategoal is to produce a system that will be able to take amathematical description of a linear algebra operation andfrom this mechanically produce high-performance libraryroutines together with performance and stability analyses.In this talk we report on progress towards this goal andwhat can be learned from the experience. This work is incollaboration with a large number of researchers at UT-Austin and elsewhere. Visit

http://www.cs.utexas.edu/users/flame/

For a complete list of participants.

Paolo BientinesiDepartment of Computer SciencesThe University of Texas at [email protected]

MS47

The Future of LAPACK and ScaLAPACK

We are planning new releases of the widely used LAPACKand ScaLAPACK numerical linear algebra libraries. Basedon an on-going user survey (www.netlib.org/lapack-dev)and research by many people, we are proposing the follow-ing improvements:

• Faster algorithms (including better numerical meth-ods, memory hierarchy optimizations, parallelism,and automatic performance tuning to accomodatenew architectures),

• More accurate algorithms (including better numericalmethods, and use of extra precision)

• Expanded Functionality (including updating anddowndating, new eigenproblems, etc. and puttingmore of LAPACK into ScaLAPACK)

• Improved ease of use (friendlier interfaces in multiplelanguages)

To accomplish these goals we are also relying on better soft-ware engineering techniques and contributions from collab-orators at many institutions. This is joint work with JackDongarra.

James W. DemmelUniversity of CaliforniaDivision of Computer [email protected]

MS47

Interfaces to Low-Level Kernels

A number of projects, including the ATLAS, PHiPAC,and GotoBLAS efforts, have shown that high-performanceBLAS libraries can be built upon a small number of highlyoptimized ”inner-kernels”. These insight raise the questionof whether the software layers, traditionally maintained bydense linear algebra libraries, need to be redefined. Forexample, on SMP systems the duplication of the packingof data for locality by different threads could be avoided ifroutines at the LAPACK library level had direct access tolow level kernels rather than through the BLAS libraries.In this talk we review how the best BLAS library imple-mentations are achieved and what opportunities there exist

by exposing the natural building blocks of these implemen-tations.

Kazushige GotoTexas Advanced Computing CenterThe University of Texas at [email protected]

MS47

A Critical Look at ScaLAPACK and PLAPACK

This talk will discuss the development of parallel linearalgebra libraries with a critical look at what works bestin ScaLAPACK and PLAPACK. We show how to developoptimal blocking scheming, including the physical blockingused in ScaLAPACK, with the inner algorithmic blockingused in benchmarks such as HPL, with the outer algorith-mic blocking used in PLAPACK. We include both a theo-retical and a practical experience with different mappingsother than the standard 2D-wrap Cartesian mappings suchas was used in a recent Top 500 entry at NASA Ames. Wemake a case for when the use of dynamically re-blockingthe data is useful in practice, and how some of the con-ventions used in ScaLAPACK actually stand in the way ofthe fastest performance. We discuss the practical use ofrecursion and alternate data structures. We make a strongcase against the mindless iteration that some projects doin a misplaced effort to mislead their users into believingthat the computer is actually doing useful work for them,and discuss how the future of HPC must follow anotherpath if we are to be successful. We make a strong case forthe best use of a benchmark such as LINPACK. As a proofof concept, we provide numbers on a final library whichachieve superior performance to any other known libraryor benchmark.

Greg HenryIntel [email protected]

MS48

The Optimal Placement of Sensors for Recoveringthe Source of a Chemical/Biological Attack

The optimal placement of sensors in a building to enablethe rapid location of the source of a chemical/biological at-tack is an important and difficult problem. Our approachto the modeling of the problem postulates a set of possibleattack scenarios against which the sensor placement mustbe effective. We give the details of our sensor placementmodel and present some numerical results. Finally, we dis-cuss the strengths and weaknesses of our approach.

Stephen Margolis, Kevin LongSandia National [email protected], [email protected]

Paul BoggsSandia National [email protected]

MS48

Identification and Prediction of Airborne Contam-inant Dispersion Via Terascale Inversion

We are interested in the localization of airborne contam-inant releases in regional atmospheric transport modelsfrom sparse observations, in time scales short enough for

PP06 Abstracts 83

predictions to be useful for hazard assessment, mitigation,and evacuation procedures. In particular, our goal is rapidreconstruction — via solution of an inverse problem —of the unknown initial concentration of the airborne con-taminant in a scalar convection-diffusion transport model,from limited-time spatially-discrete measurements of thecontaminant concentration, and from a velocity field aspredicted, for example, by a mesoscopic weather model.We employ special-purpose parallel multigrid algorithmsthat exploit the spectral structure of the inverse operator.Experiments on problems of localizing airborne contami-nant release from sparse observations on large parallel sys-tems suggest that ultra-high resolution data-driven inver-sion may be carried out sufficiently rapidly for simulation-based “real-time” hazard assessment.

Judith HillSandia National [email protected]

Volkan AkcelikCarnegie Mellon UniversityLaboratory for Mechanics [email protected]

George BirosUniversity of [email protected]

Omar GhattasUniversity of Texas at [email protected]

Andrei I. DraganescuSandia National LaboratoriesOptimization & Uncertainty [email protected]

Bart G. Van Bloemen WaandersSandia National [email protected]

MS48

Polynomial Chaos Acceleration of Bayesian Meth-ods for Solving Inverse Problems

The Bayesian setting for inverse problems provides a foun-dation for inference from noisy data and uncertain modelsand a quantitative description of uncertainty in the inversesolution. For complex PDE-based forward models, how-ever, the cost of likelihood evaluations may render Bayesianmethods prohibitive. We address this difficulty by devel-oping a polynomial chaos construction to propagate prioruncertainty through the forward model and efficiently eval-uate Bayesian integrals, demonstrating the technique on aninverse scalar transport problem.

Youssef M. MarzoukSandia National [email protected]

Habib N. NajmSandia National LaboratoriesLivermore, CA, [email protected]

MS48

Parallel Solution of Optimal Control Problems Us-

ing an Inexact SQP Algorithm

We discuss the design, implementation and application ofa parallel sequential quadratic programming (SQP) algo-rithm for the solution of PDE constrained optimizationproblems. Our algorithm is a matrix-free extension ofwell-known trust-region SQP methods. Each iteration re-quires the solution of several KKT type systems. Our SQPmethod dynamically adjusts the stopping tolerances for alliterative linear system solves based on the progress of theouter iteration. The stopping tolerances can be easily com-puted. Optimization level domain decomposition precon-ditioners are used to accelerate the system solves. Thealgorithm is implemented in C++ and separates the han-dling of optimization and linear algebra tasks. We presenttests using the Trilinos Solver Framework for the linear al-gebra subtasks. Numerical results include optimal controlproblems governed by steady-state nonlinear elliptic andincompressible Navier-Stokes equations.

Matthias HeinkenschlossRice UniversityDept of Comp and Applied [email protected]

Denis RidzalRice UniversityDept. of Computational and Applied [email protected]

MS49

Infini-T: The Infinite Thread Architecture

To ensure program correctness in multiprogramming, syn-chronization mechanisms are needed to enforce data depen-dencies and timing constraints between the various threadssuch as locks and transactions. These synchronizationmechanisms are expensive, so programmers must deploythem frugally, making them difficult to reason about andoften leading to unwanted race conditions or deadlock. Weintroduce the stored-processor architecture, which provideshardware support for the waiting mechanism. By facilitat-ing direct communication between the threads, the waitingthread can efficiently suspend until it is woken up by thereleasing thread.

Krste AsanovicMIT EECS [email protected]

MS49

Transactional Memory: Architectural Support forPractical Parallel Programming

As chip-multiprocessors replace uniprocessors in all sys-tem types, practical parallel programming is becoming anurgent need. Transactional memory is a new model forshared-memory multiprocessors that relies on user-definedtransactions as the basic unit of parallelism, coherence andconsistency, error atomicity, and optimization. Transac-tional memory simplifies parallel programming by eliminat-ing the need for manual orchestration of parallelism usinglocks. It also provides a flexible environment for dynamicoptimizations and fault-tolerant computing. This talk willintroduce the software and hardware aspects of transac-tional memory techniques developed at Stanford Univer-sity, provide an initial evaluation, and discuss its long-term

84 PP06 Abstracts

potential.

Christos KozyrakisDepartments of Electrical Engineering and ComputerScienceStanford [email protected]

MS49

How to Hurt Scientific Productivity

The boundary between software and hardware complex-ity has been increasingly shifted in the direction of thesoftware over the past few generations of microprocessorarchitectures. The result is that the path towards opti-mization is increasingly obfuscated. This talk challengesconventional wisdom in computer architecture and investi-gates how hardware architects can improve system designin ways that simplify the burden on software engineeringand improve scientific productivity.

David PattersonUC Berkeley CSEE [email protected]

MS49

Critical Factors and Key Directions For PetaflopsScale Supercomputers

The rate of improvement in the effective performance ofmodern processor architectures is departing dramaticallyfrom historical trends. The decline of brute-force clockrate improvements opens a significant opportunity to con-sider more elegant approaches to system architecture. Thispresentation will describe the causal factors that cause ma-chines that are fast on paper to provide sub-par perfor-mance on real scientific applications. It goes on to describea number of architectural alternatives for petaflops-scalescientific computing platforms.

Thomas Sterling

NASA JPL / California Institute for TechnolgyLSU Center For Computation and Technology (CCT)[email protected]

MS50

Eigenvalue Solvers for Electromagnetic Fields inCavities

We investigate the Jacobi-Davidson algorithm for comput-ing a few of the smallest eigenvalues of a generalized eigen-value problem resulting from the finite element discretiza-tion of the time-harmonic Maxwell equation. Multilevelpreconditioners are employed to improve convergence andmemory consumption of the eigensolver. We present re-sults of very large eigenvalue problems originating fromthe design of resonant cavities of particle accelerators. Wedetail our approach for parallelizing the code by means ofthe Trilinos software framework.

Roman GeusPaul Scherrer [email protected]

Marzio SalaETHZ Computational [email protected]

Peter ArbenzSwiss Federal Institute of Technology (ETH)Institute of Computational [email protected]

MS50

The Impact of the Two-Level FETI-DPH IterativeSolver on the Performance of the Inverse ShiftedLanczos Method

Domain decomposition-based linear solvers have emergedas powerful algorithms for symmetric positive definite sys-tems. However, such methods are inapplicable to symmet-ric indefinite problems, (A− sM)x = b, arising within theshift-invert Lanczos method for the generalized symmetriceigenvalue problem. The FETI-DPH solver was designedto address such systems when s is a large scalar shift. Sothis talk focuses on the performance of FETI-DPH and itsimpact on a shift-invert Lanczos eigensolver.

Charbel Farhat, Philip AveryStanford [email protected], [email protected]

MS50

A Rayleigh Quotient Minimization AlgorithmBased on Algebraic Multigrid

Several efficient algorithms are available to solve large-scaleeigenproblems. In this talk, we will review briefly some ap-proaches and then focus on a multigrid-based algorithm.The RQMG algorithm, introduced by Mandel and Mc-Cormick, approximately minimizes the Rayleigh quotientover a sequence of geometric grids. Here we will present analgebraic extension, where we replace the geometric meshinformation with the algebraic information defined by analgebraic multigrid preconditioner.

Rich Lehoucq, Ulrich L. HetmaniukSandia National [email protected], [email protected]

MS50

Performance of Parallel Projection Algorithms forSolving Nonlinear Eigenvalue Problems in Elec-tronic Structure Calculation

One of the fundamental problems in electronic structurestudy is to determine wave functions associated with theminimum total energy of large atomistic systems. Thesefunctions are solutions to a nonlinear eigenvalue problemwhich can be solved by a number of projection algorithms.The parallel performance of these algorithms are examinedand compared in this talk.

Lin-Wang WangLawrence Berkeley National [email protected]

Juan C. MezaLawrence Berkeley National [email protected]

Chao YangLawrence Berkeley National [email protected]

PP06 Abstracts 85

MS51

Runtime Support System for Partially CoupledGrid Computations

Adaptive mesh generation and refinement impose uncom-mon requirements on runtime support substrate. Conven-tional communication libraries like MPI are not well-suitedfor development of mesh generation codes. In this talk wewill present the communication library, Clam, which aimsthe applications of interest, and describe our experienceusing this code on CoWs and on TeraGrid.

Andriy FedorovThe College of William and MaryComputer Science [email protected]

Nikos ChrisochoidesComputer ScienceCollege of William and [email protected]

MS51

Building Symbiotic Relationships Between FormalVerification and High Performance Computing

Computational simulation for scientific and engineering ap-plications are becoming more ubiquitous as part of theengineering design cycle. The application of simulationscience to complex problems often require complex mod-els, sophisticated numerics and intricate implementations.Tremendous effort has been expended toward the devel-opment of systematic techniques for model validation andnumerical method verification. As most researcher’s hesi-tantly admit, the amount of time spent debugging intricatehigh performance parallel implementations of their simu-lations consume a large bulk of their time. In particular,many would argue that although this debugging time isnecessarily, it distracts one from the science or engineeringproblem of interest. In this talk, we will present a new ef-fort by the Utah Gauss Group to employ formal verificationtechniques to the debugging of parallel high performancecomputing codes using MPI. This synergistic combinationof formal techniques with HPC is designed to infuse newsways of thinking about parallel code design through in-teraction of two normally disparate communities, with thegoal of benefiting both communities.

Robert KirbyUniversity of [email protected]

MS51

Developing and Supporting a Large Scale ParallelApplication for Scientific Users - The TITAN Geo-physical Mass Flow Code Experience

Over the last few years we have developed the TITANgeophysical mass flow modeling toolset. This tool is inwidespread use for volcanic hazard mapping. The tool of-fers, parallel adaptivity as a primary mechanism for ren-dering solvable some of the computaitionally intractableproblems in modeling hazardous mass flows over naturalterrain. Optional modules also enable grid based use, dy-namic data and stchastic modeling of input parameters.We will present the significant success of the developmentbased on a common parallel infrastructure and aspects thatenabled many geologists with minimal expertise in numer-

ical methodology to use this tool successfully.

Keith DalbeyDept of Mechanical EngineeringSUNY at [email protected]

Abani K. PatraSUNY at BuffaloDept of Mechanical [email protected]

E Bruce Pitman, M. JonesSUNY [email protected], [email protected]

MS51

From Physical Modeling to Scientific Understand-ing: An End-to-End Approach to Parallel Super-computing

Conventional parallel scientific computing uses files as in-terface between simulation components such as meshing,partitioning, solving and visualizing. This approach resultsin time-consuming file transfers, disk I/O and data formatconversions that consume large amounts of network, stor-age, and computing resources while contributing nothingto applications. We propose an end-to-end approach thatreplaces the cumbersome file interface with a scalable, par-allel, runtime data structure, on top of which all simulationcomponents are constructed. This framework has been im-plemented for octree-based finite element simulations; per-formance evaluation for earthquake simulations on up to2048 Alpha processors shows good good isogranular andfixed-size scalability.

Jacobo BielakCarnegie Mellon UniversityDepartment of Civil & Env. [email protected]


Kwan-Liu MaDepartment of Computer ScienceUniversity of California at [email protected]

Hongfeng YuUniversity of California [email protected]

Leonardo Ramirez-GuzmanCarnegie Mellon [email protected]

David O’Hallaron, Tiankai TuCarnegie Mellon [email protected], [email protected]

MS52

PyACTS: A High-Level User Interface to TheACTS Collection

The ACTS Collection brings a robust and high perfor-mance set of software libraries to the hands of compu-

86 PP06 Abstracts

tational scientists. However, application developers oftendont use tools because of the intricacy in understandingarguments and parameters that mostly relate to runningefficiently in parallel computing environments. PyACTSprovides a didactical user interfaces for helping users withtheir first application prototype and subsequent code de-velopments. Here we look at existing functionality in Py-ACTS, applications and future plans.

Leroy A. DrummondComputationa Research DivisionLawrence Berkeley National [email protected]

Vicente Galiano, Violeta Migallon, Jose PenadesUniversidad de [email protected], [email protected],[email protected]

MS52

Numerical Policy Interface for Automatic TuninigLibrary

Recently, parameters for performance tuning are diverse inmany matrix libraries, thus automatic tuning facilities havebeen required. In this talk, a new framework, numericalpolicy, is proposed to achieve higher semantics library-userrequirement. The framework is applied to Lanczos eigen-solver on a SMP platform, one node of the SR11000. Theresult shows that the framework effectively selects a goodparameter value that balances between amount of memoryand computation time.

Ken NaonoCentral Research Laboratory, Hitachi, [email protected]

Hiroyuki Kidachi, Mitsuyoshi IgaiHitachi ULSI Systems Cp., [email protected], [email protected]

MS52

Tools for Performance Discovery and Optimization

The TAU Performance System offers integrated perfor-mance instrumentation, measurement and analysis capa-bilities. In this minisymposium, we describe how TAUis applied to evaluate the performance of numerical ker-nels. Performance profiles generated by TAU are storedin a performance database. The performance database isqueried for historical performance data and application-specific metadata. This enables selection of appropriatenumerical algorithms to match the given problem. TAU’sPerfExplorer framework is used for performance data min-ing and performance knowledge discovery and optimiza-tion.

Sameer ShendeDept. of Computer and Information ScienceUniversity of [email protected]

MS52

OSKI: A Library of Automatically Tuned SparseMatrix Kernels

The Optimized Sparse Kernel Interface (OSKI) is a collec-tion of low-level C primitives that provide automatically

tuned computational kernels on sparse matrices, for use bysolver libraries and applications. While conventional im-plementations of sparse matrix-vector multiply run at 10%of peak or less, careful tuning can achieve 31% of peak and4x speedups. However, tuning depends on the matrix andmachine, and is deferred until run-time. This talk reviewsthe design and implementation of OSKI.

Richard VuducLawrence Livermore National [email protected]

James W. DemmelUniversity of CaliforniaDivision of Computer [email protected]

Katherine YelickUC [email protected]

MS53

A Software Framework for Parallel Agent-BasedApplications

Agent-based computing can significantly improve the abil-ity to model, design, and build complex systems in a vari-ety of application areas. The large computational require-ments of these simulations make the development of scal-able software important. The presented software frame-work includes an agent movement module, which is crucialto spatially explicit agent models on continuous space, andan agent behavior module, which embeds parallel elementsearching algorithms to support the agents interaction withother agents and well as the surrounding environment.

Robert M. HunterMajor Shared Resource Center (MSRC)U.S. Army Engineer Research and Development [email protected]

Preston McAllisterDepartment of Computer Science and EngineeringMississippi State [email protected]

Cary D. ButlerInformation Technology LaboratoryU.S. Army Engineer Research and Development [email protected]

Jing-Ru Cheng

Major Shared Resource Center (MSRC)U.S. Army Engineer Research and Development Center(ERDC)[email protected]

MS53

Application of a Parallelized Eulerian-Lagrangian-Agent Method (ELAM) for Ecohydraulics and Wa-ter Resource Management

We describe performance of a parallelized EulerianLa-grangianagent method (ELAM), which integrates (1) aEulerian mesh framework describing the physical, hydro-dynamic, and water quality domains, (2) the Lagrangianframework describing sensory perception and movementtrajectories of individuals, and (3) the agent framework

PP06 Abstracts 87

describing the behavior decisions of individuals. ELAMsare designed to mechanistically decode/forecast movementpatterns of individual fish and must efficiently handle largenumbers of virtual individuals, unstructured mesh ele-ments, and time steps.

John M. Nestler, Robert M. Hunter, R. Andrew GoodwinU.S. Army Engineer Research & Development [email protected],[email protected],[email protected]

Jing-Ru ChengMajor Shared Resource Center (MSRC)U.S. Army Engineer Research and Development Center(ERDC)[email protected]

MS53

Simulating the Emergence of Hierarchical Organi-zation in Ecological Systems


Anthony KingProgram for Ecosystem ResearchOak Ridge National [email protected]

MS53

Algorithms and Software for Agent-Based Modelsof Biological Systems

We present algorithms and software for agent-based simu-lations of biological systems. Our goal is the developmentof a computational framework for the modeling of the im-mune response at the cellular level. The framework allowsfor cellular interactions, with or without internal state, tobe modeled with individual agents. Simple protiens can bemodeled as volume concentrations. Object motion such asdiffusion and chemotaxis are efficiently modeled by statisti-cal means. Results from the use of this software frameworkto simulate the persisten infection of the Epstein-Barr virusand acute influenza infections of the lung are presented.

Mark JonesDept. of Elect. and Comp. Eng.Virginia [email protected]

Paul E. PlassmannVirginia [email protected]

MS54

Parallel Adjoint-Based Methods for Electromag-netic Shape Optimization

We formulate the problem of designing the low-loss cavityfor the International Linear Collider (ILC) as an electro-magnetic shape optimization problem involving a Maxwelleigenvalue problem. The objective is to maximize thestored energy of a trapped mode in the end cell while main-taining a specified frequency corresponding to the acceler-ating mode. A continuous adjoint method is presented forcomputation of the design gradient of the objective andconstraint. The gradients are used within a nonlinear op-timization scheme to compute the optimal shape for a sim-

plified model of the ILC. We discuss challenges involved inparallelizing the method.

Lie-Quan LeeAdvanced Computations DepartmentStanford Linear Accelerator [email protected]


Volkan AkcelikCarnegie Mellon UniversityLaboratory for Mechanics [email protected]

Kwok KoStanford Linear Accelerator [email protected]


MS54

PDE-Constrained Optimization Algorithms forProtein-Protein Interactions

We discuss a protein-protein interaction problem in whichelectrostatic interactions between the solvent and proteinsurfaces are modeled using a standard classical Poisson-Boltzmann (PB) formulation. The optimization param-eters are the charge positions. The problem is highly-nonlinear and combinatorial in nature. We give detailsfor a simplified version in which we fix the geometry of thetwo proteins and we allow a continuous variation of the po-sition of the charge. We first present a parallel (matrix-freemultigrid accelerated) solver for the forward problem andthen we present the formulation and scalability results foradjoint sensitivity analysis.


MS54

Multi-Level Techniques for Inverse Problems In-volving Incompressible Flows

We show that multi-level techniques designed for solvingregularized ill-posed problems can be applied successfullyto the inverse problem of reconstructing the initial state ofa time-dependent incompressible flow from sparse measure-ments in space and time. We show that, at fine resolution,this inverse problem can be solved at a cost that is a rea-sonably small multiple of the cost for a forward solve.

Andrei I. DraganescuSandia National LaboratoriesOptimization & Uncertainty [email protected]

Bart G. Van Bloemen WaandersSandia National [email protected]

88 PP06 Abstracts

MS54

The Lagged Time Method for Parallelizing Opti-mization with Time Dependent Constraints

In this talk we describe a method to solve optimizationproblems with time dependent constraints. We propose anew method we call, lagged time, where we separate thecomputation into two steps that can be done in parallel.This allows us to trivially parallelize our algorithm. Weshow performance testing which indicate that the methodscales well when the number of processors increase.

Eldad HaberEmory UniversityDept of Math and [email protected]

MS55

The Bootable Cluster CD: From Student Lab toStudent Cluster

The Bootable Cluster CD (BCCD) provides an instantclustering environment when booted into ram from theCDROM drives of networked workstations. This makes theBCCD ideally suited for existing Windows and Mac labsfound at virtually every academic institution. A completeand fully-configured clustering environment is establishedwithin minutes, requires no permanent installation, andleaves no trace on the host system after rebooting, makingit an ideal tool for High Performance Computing Educa-tion.

Paul GrayUniversity of Northern IowaComputer [email protected]

MS55

Teaching Practices for Introductory Parallel Com-puting

When introducing parallel computing concepts to studentsfor the first time, using examples that are visual, that showspeed-up, that include real-time analysis of results, andthat run in a short enough period of time to fit in a stan-dard lecture can help to emphasize key concepts. A se-ries of publicly available examples hosted on the Compu-tational Science Education Reference Desk of the NationalScience Digital Library will be presented.

David A. JoinerKean UniversityAssistant [email protected]

MS55

Little-Fe: An Inexpensive, Portable Teaching Clus-ter for All Levels of Computational Science Educa-tion

As scientific research moves increasingly towards interdisci-plinary collaboration, computational science education lan-guishes on the periphery. A key cause is the lack of comput-ing resources available to the average classroom. We havedeveloped a portable teaching cluster, Little-Fe, which isboth easy and inexpensive to replicate. Little-Fe’s operat-ing environment is the Bootable Cluster CD, which comescomplete with HPC computational tools, applications and

curricular material for computational science education.

Charles PeckCluster Computing Group, Computer ScienceEarlham [email protected]

Thomas P. MurphyContra Costa CollegeComputer [email protected]

MS55

Teaching Parallel Computing to Non-ComputerScientists

Parallel Computing can appear to be difficult to learn, andtypically is presented in technical terms with substantialdetail. Yet the basic concepts can be straightforwardlydescribed in a non-technical manner, using analogies, sto-rytelling and live demonstrations, so as to be accessibleto computing novices. This talk will address techniquesfor teaching parallelism to diverse audiences, ranging frompracticing scientists and engineers to non-programmers.

Henry NeemanUniversity of OklahomaOU Supercomputing Center for Education & [email protected]

MS56

A Distributed Packed Storage for Large ParallelCalculations

Our initial motivation was to develop for the GOCE1 mis-sion efficient parallel solvers that can tackle the very largedense linear least squares problems encountered in gravityfield calculations. Since unlike LAPACK, there is no sup-port for packed matrices in the standard parallel libraryScaLAPACK, we implemented a packed distributed stor-age based on ScaLAPACK computational kernels and thatexploits the symmetry or the triangular structure of a ma-trix. We present performance results for the Cholesky fac-torization and for the updating of the R factor in the QRfactorization.

Serge [email protected]

Marc BaboulinCERFACS, [email protected]

Luc GiraudCERFACS, Toulouse, [email protected]

MS56

BLAS 2.5, Householder Bidiagonalization, andComputer Architecture

Due to recent advances in computing matrix singular vec-tors, the initial reduction to bidiagonal form is typicallythe most computationally intensive part of a singular value

1Gravity field and steady-state Ocean Circulation Explorer,European Space Agency, satellite scheduled for launch in 2006

PP06 Abstracts 89

decomposition. This talk mainly explains how the orderof computations can be rearranged to speed Householderbidiagonalization on cache-based computer architectures.A BLAS 2.5 operator combines several BLAS 2 matrixvector operations, increasing data locality. For example,a GEMVT combines two matrix vector multiplications( GEMVs) so that both can be performed in only one fetchof the matrix from RAM. For an n×n matrix, a BLAS 2.5 -BLAS 3 Householder bidiagonalization uses GEMVT andGEMM (matrix matrix) operations for all but O(n2) ofa total of O(n3) flops. Compared to performing the sameoperations by a BLAS 2- BLAS 3 ( GEMV matrix vec-tor, GEMM) the number of fetches of data from RAM isapproximately halved. The serial BLAS 2.5-BLAS 3 algo-rithm significantly speeds bidiagonalization on most mod-ern architectures. Whether the BLAS 2.5-3 algorithm isuseful in the distributed memory parallel case depends oncache size and communication latency. If the time to starta communication is much less than the time to refill cachememory, then parallel GEMVT is faster than using par-allel GEMV operations. Then a BLAS 2.5-3.0 diagonal-ization is desirable. Automated adaptation of algorithm toarchitecture and extensibility to the sparse case will alsobe discussed.

Gary HowellNorth Carolina State [email protected]

J. W. DemmelUniversity of California, [email protected]

M. Premkumar, C.T. FultonFlorida Institute of [email protected], [email protected]

Sven J. HammarlingNumerical Algorithms Group LtdWilkinson [email protected]

MS56

Parallel Algorithms and Software for Solving Stan-dard and Generalized Sylvester-Type Equations

We develop parallel ScaLAPACK-style algorithms basedon the Bartels-Stewart’s method for solving continuous-time and discrete-time standard and generalized Sylvester-type matrix equations. The reduced triangular problemsare solved using explicit matrix blocking and wavefront-like traversal of the right-hand-side matrices. Node solversare either recursive or explicit blocked variants. We applyour solvers in condition estimation of the matrix equations,developing parallel Sep−1-estimators. Our deliverables in-clude a software package SCASY which contains generaland triangular solvers and condition estimators.

Robert GranatDept. of Computing Science and HPC2NUmea University, [email protected]

Bo T. KagstromUmea UniversityComputing Science and [email protected]

MS56

Updating-Downdating Factorizations in ScaLA-PACK

We present some parallel distributed software based onthe ScaLAPACK kernels for performing the updating anddowndating of various factorizations of ScaLAPACK (QR,LU and Cholesky). Example codes with performance andscalability results are given. Issues when parallelizing thesequential algorithm are described.

Julien LangouThe University of [email protected]

MS57

Programming Multi-Core Processors

Mainstream multi-core computers require mainstream par-allel applications. The software industry must fundamen-tally change to embrace concurrency and make parallelsoftware the norm. This presentation will explore the pro-gramming models and languages that will enable this trans-formation to a brave new world of ubiquitous parallel com-puting.

Ali-Reza Adl-TabatabaiProgramming Systems LabIntel [email protected]

MS57

Multi-Core in the Real World: The IBM Cell Pro-cessor as an Early Multi-Core Product


Peter [email protected]

MS57

What are Multi-Core Processors and Why AreThey Inevitable?

All of the major CPU vendors are moving to multi-coreproducts. They aren’t a passing fad or the latest mar-keting ploy. When competing CPU vendors all move inthe same direction, you know something is up. Multi-coreCPUs are the only way for CPU performance to increasewithin a fixed power envelope as transistor density con-tinues to climb. In this presentation, we will explain whymulti-core CPUs are an inevitable consequence of semicon-ductor manufacturing. And we will lay out at a high levelthe opportunities and challenges presented by the move tomulti-core CPUs.

Timothy MattsonIntel [email protected]

MS57

Programming Multi-Core processors and SMPswith OpenFLAME

The Formal Linear Algebra Methods Environment(FLAME) project at UT-Austin pursues the systematicderivation of proven correct algorithms for dense linear

90 PP06 Abstracts

algebra operations. The FLAME/C API allows the al-gorithms to be conveniently coded in the C programminglanguage. It has been used to reimplement large portionsof the popular BLAS and LAPACK libraries on sequen-tial processors. For SMP architectures, the API has beenextended to use OpenMP, yielding the OpenFLAME API.Through a simple mechanism known as work queuing high-performance parallel implementations can be almost triv-ially achieved. The approach should also support multi-core systems with a medium number of cores (32-64). Inthis talk we present the approach and show scalabilitylessons that can be learned. This work is in collabora-tion with a number of FLAME project participants. Fordetails, visit

http://www.cs.utexas.edu/users/flame/

Field Van ZeeDepartment of Computer SciencesThe University of Texas at [email protected]

MS58

Challenges Facing Future Mesh-based Codes atLarge Scale: An Application-centric View

Parallel scientific codes today often make use of structured,static spatial grids partitioned among processors in a regu-lar fashion. This practice not only simplifies the process ofmesh generation and application development, but also al-lows the logical topology of the application to be optimallymapped onto the underlying physical topology of the par-allel machine. As illustration, we present two large-scaleapplications typical of what can be found in national labo-ratories: HYCOM, a general ocean circulation model, andKRAK, a hydrodynamics code used to simulate explosivedetonations. Using performance modeling techniques de-veloped at Los Alamos National Laboratory, we examinehow such codes utilize today?s computing resources. Withthis foundation, we then peer into the near future and sur-mise how the use of such applications will evolve; as codesmake use of larger and more refined spatial grids, dynamicmesh adaptivity may play a critical role as parallel appli-cations strive to make efficient use of parallel resources. Atthe same time, we examine recent trends in supercomputerarchitecture and surmise that machines of the future maybecome more structured in their network topology, as illus-trated by IBM?s BlueGene/L machine and the Cray XT3,both of which utilize 3D torus interconnect topologies. Wedescribe one research activity, an intelligent system soft-ware toolkit known as the Parallel Runtime Environmentfor Multi-computer Applications, or PREMA, that goessome way toward bridging the gap between increasinglyregular parallel machine topologies and the demands of ir-regular and unstructured scientific codes.

Kevin BarkerLos Alamos National [email protected]

MS58

Parallel Out-of-Core Delaunay Refinement Methodusing COTS

We present a solution for generating very large meshes ina small cluster of workstations environment using commer-cial off-the-shelf hardware and the best publicly availableoff-the-shelf sequential in-core Delaunay mesh generator.It is cost-effective since the total wall-clock time includ-

ing wait-in-queue delays for the out-of-core methods on asmall cluster is considerably shorter than that for the in-core generation of the same size mesh (using over a hundredprocessors). It is also high-performance since the speed ofmesh generation of our out-of-core method is comparableto the speed of in-core implementation of the algorithm.

Andriy KotComputer Science DepartmentThe College of William and [email protected]

MS58

Distributed Adaptive and Interactive SimulationsUsing Structured Adaptive Mesh-Refinement

Structured adaptive simulations can yield highly advanta-geous cost/accuracy ratios and can enable accurate solu-tions to realistic models of complex physical phenomenon.However, the phenomena being modeled by these simu-lations and their implementations are inherently dynamicand heterogeneous, and their large scale implementationpresents many challenges. In this presentation I will outlinethe challenges of distributed SAMR and will present a com-putational engine that enables efficient and scalable imple-mentations of these applications. Specifically, I will addressdynamic data-management, adaptive load-balancing, andinteractive monitoring and control.

Manish ParasharDepartment of Electrical & Computer EngineeringRutgers [email protected]

MS58

Resource-Aware Parallel Scientific Computationfor Heterogeneous and Non-Dedicated Clusters

Many modern parallel computing environments includeheterogeneous resources or allow shared access to resources.We present a system called the Dynamic Resource Utiliza-tion Model to help guide dynamic load balancing in thepresence of heterogeneity or non-dedicated usage of pro-cessing and communication resources. DRUM combinesinformation from a static description of a computing envi-ronment with information from a dynamic monitoring sys-tem to produce an appropriate distribution of work amongthe cooperating processes.

James D. TerescoDepartment of Computer ScienceWilliams [email protected]

PP0

A Parallel Multishift QZ Algorithm for GeneralizedEigenvalue Problems

The QZ algorithm reduces a regular matrix pair (H,T )in Hessenberg-triangular form to a generalized Schur form(S, T ). A novel parallel implementation of a small-bulgemultishift QZ algorithm for solving the generalized eigen-value problem is presented. The algorithm chases chainsof small bulges instead of only one bulge in each QZ iter-ation, which makes it more amenable in a parallel setting.In addition, advanced deflation strategies are used, includ-ing aggressive early deflation techniques. Recent progress

PP06 Abstracts 91

will be reported including parallel performance results.


Bjorn Adlerborn, Daniel KressnerUmea [email protected], [email protected]

PP0

Comparing Parallel CAMR and SAMR Hydrody-namics Simulations

We compare the relative accuracy and performance of twoparallel 3D Eulerian adaptive mesh refinement (AMR)codes, one cell-wise and one patch-based, on several hy-drodynamics test problems. Using software developedfor analyzing the comprehensive performance behavior oflarge-scale parallel applications, we investigate relation-ships between solution time, processor counts, communica-tion amount, message sizes, active- and ghost-cell counts,time steps per level, load balancing efficiency, flop rates,and memory access rates.

Michael Norman, James Bordner, Alexei KritsukUniversity of California, San [email protected], [email protected],[email protected]

Robert WeaverLos Alamos National [email protected]

PP0

Scalable Implicit Solutions for MHD Models

We describe scalable implicit solutions for time depen-dent, two dimensional, stream function based, magneto-hydrodynamics (MHD) models. We employ a fully im-plicit Newton-Krylov-Schwarz approach, the linear systemsbeing preconditioned with one level additive Schwarz andincomplete LU on subdomains, and solved with GMRESiterations. The space discretization, based on finite differ-ences, is second order and the time discretization, based onbackward differentiation formulas is second, third or fourthorder.


Florin DobrianColumbia [email protected]

Serguei OvtchinnikovUniversity of Colorado at BoulderDepartment of Computer [email protected]

David E. KeyesColumbia UniversityDepartment of Applied Physics & Applied [email protected]

PP0

RECSY and SCASY: Library Software for MatrixEquations

High-performance software forsolving common continuous-time and discrete-time stan-dard and generalized Sylvester-type matrix equations arepresented. RECSY uses recursive blocked algorithms forsolving one-sided and two-sided matrix equations on se-rial and SMP nodes. The recursive approach leads to anautomatic variable blocking amenable for deep memory hi-erarchies. SCASY uses explicit matrix blocking in paral-lel ScaLAPACK-style algorithms for solving similar equa-tions on distributed memory platforms. Both RECSY andSCASY provide condition estimation functionality.

Isak JonssonUmea UniversityDept of Computing Science and [email protected]


Robert GranatDept. of Computing Science and HPC2NUmea University, [email protected]

PP0

A Case Study in Tools and Techniques for Opti-mization on the Power5

This presentation, in poster format, will detail the vari-ous tools and techniques that were used to optimize a FastPoisson Solver code using one of a class of spectral meth-ods. We cover both the tools that we used for profilingthe program, including dynamic instrumentation, as wellas various techniques available to optimize numerically in-tensive kernels in the POWER5 processor environment.

John MartineI.B.M. [email protected]

PP0

Performance Intercomparison of Community At-mosphere Model on High-End Computing Plat-forms

We analyze the performance of the Community Atmo-sphere Model when run at high resolution with the finite-volume dynamical core on leading high-end computing sys-tems. This includes the vector-based Cray X1E and EarthSimulator, IBM systems including the Power-5, and theIntel Itanium 2 cluster. We consider various paralleliza-tion and communication paradigms. The throughput ona per-processor basis is highest on the Cray-X1E, whichoutperforms the IBM Power-3 by close to an order of mag-nitude.

Patrick H. WorleyOak Ridge National [email protected]

Arthur A. MirinLawrence Livermore Nat’l Lab.

92 PP06 Abstracts

[email protected]

Leonid Oliker, Michael WehnerLawrence Berkeley National [email protected], [email protected]

William SawyerSwiss Federal Institute, [email protected]

David ParksNEC Solutions [email protected]

PP0

Modeling Hazardous Geophysical Mass Flows

We describe in this poster the development of a toolkitfor modeling a range of hazardous geophysical mass flows.These flows range from hundreds of meters to tens of kilo-meters in runout length and flow durations often in thehours. Large scale parallel adaptive discontinuous Galerkinschemes are essential to enabling realistic simulations in-corporating the complex representations of physics neededand stochastic computations needed for hazard analysis.

Abani K. PatraSUNY at BuffaloDept of Mechanical [email protected]

PP0

Measuring and Reducing Network Latency in the2.6 Linux Kernel

Network latency continues to be a bottleneck for manycomputational science applications on Beowulf clusters.Using open source tools, we develop technology for makingnanosecond precision measurements between timing pointswithin the 2.6 Linux kernel. Using this technology, in con-junction with low-level and application benchmarks, we es-tablish that there is significant network latency in the ker-nel and identify its origin. Lastly, we survey software-basedlatency reduction technologies and make recommendationsbased on our measurements.

Skylar Thompson, Josh McCoy, Alex Lemann, TobiasMcNulty, Kevin HunterCluster Computing GroupEarlham [email protected], [email protected],[email protected], [email protected],[email protected]


PP0

Little-Fe - The Low-Cost Portable Cluster for Com-putational Science Education

Many institutions still do not have access to high per-formance computing platforms for computational scienceeducation. Little-Fe is a complete 8 node Beowulfstyle portable computational cluster. By leveraging theBootable Cluster CD project, and its associated curricu-

lum modules, Little-Fe makes it possible to have a pow-erful turn-key educational platform for under $2,500. Wedescribe Little-Fe’s hardware and software configuration,including plans for a ”do it yourself” version.

Skylar Thompson, Josh McCoy, Alex Lemann, TobiasMcNulty, Kevin HunterCluster Computing GroupEarlham [email protected], [email protected],[email protected], [email protected],[email protected]


PP0

Folding@Clusters: Harnessing Grid-Based Paral-lel Computing Resources for Molecular DynamicsSimulations

Folding@Clusters is an adaptive framework for harnessinggrid-based cluster resources for protein folding research.It combines capability discovery, load balancing, processmonitoring, and checkpoint/re-start services to provide aplatform for molecular dynamics simulations on a range ofgrid-based parallel computing resources. We examine thedesign, implementation, and field-trial results for the firstmajor release of the software, and discuss ways to improveit based on our initial production experiences.

Skylar Thompson, Josh McCoy, Alex LemannCluster Computing GroupEarlham [email protected], [email protected],[email protected]


Joshua HurseyComputer Science DepartmentEarlham [email protected]

Vijay PandeStanford UniversityPande [email protected]

PP0

The Parallel Computation of Radiative Heat Trans-fer Effects with Non-Grey Absorption CoefficientModels

We consider the problem of computing radiative heat trans-fer effects in simulations where complex, non-grey absorp-tion models are required. These calculations are imple-mented within a parallel, photon Monte Carlo softwareframework. The implementation’s parallel performancedepends on the determination of a load balancing strat-egy that takes into account temperature and absorptioncoefficient differences within the computational domain.We present computational results illustrating our proposed

PP06 Abstracts 93

load balancing strategy for a variety of representative com-bustion problems.

Paul E. PlassmannVirginia [email protected]

Ivana VeljkovicPenn State UniversityDepartment of Computer Science and [email protected]

Kamal ViswanathVirginia [email protected]

PP0

Parallel Rosenbrock Methods for Solving OptionPricing Models

In this paper we consider a meshfree radial basis functionapproach for the valuation of pricing options with non-smooth payoffs. By taking advantage of parallel architec-ture, a strongly stable and highly accurate time steppingmethod is developed with computational complexity com-parable to the implicit Euler method implemented concur-rently on each processor. This, in collusion with the radialbasis function approach, provides an efficient and reliablevaluation of exotic options, such as digital options and op-tions with multiple strike prices.

David A. VossWestern Illinois [email protected]

pp06 abstracts · 42 pp06 abstracts tion framework coolfluid is a state-of-the art object oriented...

Documents