working group 1 enabling technologies chair: sheila vaidya vice chair: stu feldman

Working Group 1Enabling Technologies

Chair: Sheila Vaidya Vice Chair: Stu Feldman

WG 1 – Enabling TechnologiesCharter

• Charter– Establish the basic technologies that may provide the foundation for

important advances in HEC capability, and determine the critical tasks required before the end of this decade to realize their potential. Such technologies include hardware devices or components and the basic software approaches and components needed to realize advanced HEC capabilities.

• Chair– Sheila Vaidya, Lawrence Livermore National Laboratory

• Vice-Chair– Stuart Feldman, IBM

WG 1 – Enabling TechnologiesGuidelines and Questions

• As input to HECRTF charge (1a), Please provide information about key technologies that must be advanced to strengthen the foundation for developing new generations of HEC systems. Include discussion of promising novel hardware and software technologies with potential pay-off for HEC

• Provide brief technology maturity roadmaps and investments, with discussion of costs to develop these technologies

• Discuss technology dependencies and risks (for example, does the roadmap depend on technologies yet to be developed?)

• Example topics:– semiconductors, memory (e.g. MRAM), networks (e.g. optical),

packaging/cooling, novel logic devices (e.g. RSFQ), alternative computing models

Working Group Participants

• Kamal Abdali, NSF• Fernand Bedard, NSA• Herbert Bennett, NIST• Ivo Bolsens, XILINX• Jon Boyens, DOC• Bob Brodersen, UC Berkeley• Yolanda Comedy, IBM• Loring Craymer, JPL• Bronis R. de Supinski, LLNL• Martin Deneroff, SGI• Stuart Feldman, IBM (VICE-CHAIR)• Sue Fratkin, CASC

• David Fuller, JNIC/Raytheon

• Gary Hughes, NSA

• Tyce McLarty, LLNL

• Kevin Martin, Georgia Tech

• Virginia Moore, NCO/ITRD

• Ahmed Sameh, Purdue

• John Spargo, Norhrop-Grumman

• William Thigpen, NASA

• Sheila Vaidya, LLNL (CHAIR)

• Uzi Vishkin, U Maryland

• Steven Wallach, Chiaro

Timescales

• 0-5 years– Suitable for deployment in high-end systems within next 5 years

• Implies that the technology has been tried and tested in a systems context

• Requires additional investment beyond commercial industry

• 5-10 years– Suitable for deployment in high-end systems in 10 years

• Implies that the component has been studied and feasibility shown

• Requires system embodiment and growing investment

• 10+ years– New research, not yet reduced to practice

• Usefulness in systems not yet demonstrated

Interconnects

Passive

• 0-5– Optical networking

– Serial optical interface

• 5-10– High-density optical networking

– Optical packet switching

• 10+– Scalability (node density,

bandwidth)

Active

• 0-5– Electronic cross-bar switch

– Network processing on board

• 5-10– Data Vortex

– Superconducting cross-bar switch

• 10+

Power/Thermal Management, Packaging

• 0-5– Optimization for power efficiency– 2.5-D packaging– Liquid cooling (e.g., spray)

• 5-10– 3-D packaging and cooling (microchannel) – Active temperature response

• 10+– Higher scalability concepts (improving OPS/W)

Single Chip Architecture

• 0-5– Power-efficient designs

– System on Chip; Processor-in-Memory

– Reconfigurable circuits

– Fine-grained irregular parallel computing

• 5-10– Adaptive architecture

– Optical clock distribution

– Asynchronous designs

• 10+

Memory

Main Memory

• 0-5– Optimized memory hierarchy

– Smart memory controllers

• 5-10– 3-D memory (e.g., MRAM)

• 10+– Nanoelectronics

– Molecular electronics

Storage & I/O• 0-5

– Object-based storage– Remote DMA– I/O controllers (MPI, etc.)

• 5-10– Software for “cluster” storage

access to– MRAM, holographic, MEMS,

STM, E-beam

• 10+– Spectral hole burning– Molecular electronics

Device Technologies

• 0-5– Silicon on Insulator, SiGe, mixed III-V devices– Integrated electro-optic and high-speed electronics

• 5-10– Low-temperature CMOS– Superconducting - RSFQ

• 10+– Nanotechnologies– Spintronics

Algorithms, SW-HW Tools

• 0-5– Compiler innovations for new architectures– Tools for robustness (e.g., delay, fault tolerance)– Low-overhead coordination mechanisms– Performance monitors– Sparse matrix innovations

• 5-10– Very High Level Language hardware support– Real-time performance monitoring and feedback– PRAM (Parallel Random Access Machine model)

• 10+– Ideas too numerous to select

Generic Needs

• Sharing– NNIN-like consortia

• National Nanotechnology Infrastructure Network

– Custom hardware production– Intellectual Property policies (open?)

• Tools for– Design for Testability– Physical design– Testing and Verification– Simulation– Programmability

High-Impact Themes• 0-5

– Show value of HEC solutions to the commercial sector – Facilitate sharing and collaboration across HEC community– Technology

• Power/thermal management• Optical networking

• 5-10– Long-term consistent investment in HEC– Technology

• 3-D Packaging• New devices (MRAM, MEMS, RSFQ)• Power/thermal management & Optical – Ongoing

• 10+ years– Continued research for HEC

Working Group 2COTS-Based Architecture

Chair: Walt Brooks

Vice Chair: Steve Reinhardt

WG2 – Architecture: COTS-based Charter

• Charter– Determine the capability roadmap of anticipated COTS-based

HEC system architectures through the end of the decade. Identify those critical hardware and software technology and architecture developments, required to both sustain continued growth and enhance user support.

• Chair– Walt Brooks, NASA Ames Research Center

• Vice-Chair– Steve Reinhart, SGI

WG2 – Architecture: COTS-based Guidelines and Questions

• Identify opportunities and challenges for anticipated COTS-based HEC systems architectures through the decade and determine its capability roadmap.

• Include alternative execution models, support mechanisms, local element and system structures, and system engineering factors to accelerate rate of sustained performance gain (time to solution), performance to cost, programmability, and robustness.

• Identify those critical hardware and software technology and architecture developments, required to both sustain continued growth and enhance user support.

• Example topics:– microprocessors, memory, wire and optical networks, packaging, cooling, power

distribution, reliability, maintenance, cost, size


• Steve Reinhardt (co-chair)• Bill Kramer(L)• Don Dossa• Dick Hildebrandt• Greg Lindahl• Tom McWilliams• Curt Janseen• Erik DeBenedicttis

• Walt Brooks(chair)• Rob Schreiber(L)• Yuefan Deng• Steven Gottlieb• Charles lefurgy• John Ziebarth• Stephen Wheat• Guang R. Gao• Burton Smith

Assumptions/Definitions• Definition of “COTS based”

– Using systems originally intended for enterprise or individual use– Building Blocks-Commodity processors, commodity memory and commodity disks– Somebody else building the hardware and you have limited influence over– Examples

• IN-Redstorm, Blue Planet, Altix• OUT-X1, Origins, SX-6

• Givens– Massive disk storage (object stores)– Fast wires (SERDES-driven)– Heterogeneous systems (processors)

Primary Technical Findings• Improve memory bandwidth

– We have to be patient in the short term for the next 2-3 years the die has been cast

– Sustained Memory bandwidth is not increasing fast enough – Judicious investment in the COTS vendors to effect 2008

• Improve the Interconnects-”connecting to the interconnect” – Easier to influence than memory bandwidth – Connecting through I/O is too slow we need to connect to CPU at memory

equivalent speeds• One example is HyperTransport which represents a memory grade interconnect

in terms of bandwidth and is a well defined I/F -others are under development

• Provide ability for heterogeneous COTS based systems. – E.g. -FPGA, ASIC,… in the fabric

• FPGA allows tightly coupled research on emerging execution models and architectural ideas without going to foundry

• Must have the software to support programming ease for FPGA

Technology InfluenceDirectCost toDesign

DirectLead Time

4-6 years

$5 - 100M

1 year

$5M

$0.2M

Board

Board/Components

Nodes/Frames

I/O

Interconnect

CPU/Chips

IndirectLead Time

IndirectCost toDesign

10-15 years

$10M

$50M

$300- 1,000M

Programmatic Approaches• Develop a Government wide coordinated method for direct

Influence with the vendors to make “designs” changes– Less influence with COTS mfrs, more with COTS-based vendors– Recognize that commercial market is the primary driver for COTS

• “Go” in early• Develop joint Government. research objectives-must go to vendors with a short

focused list of HEC priorities

– Where possible find common interests with the industries that drive the commodity market

– “Software”- we may have more influence-

• Fund long Term Research – Academic research must have access to systems at scale in order to do

relevant research– Strategy for moving University research into the market

• Government must be an early adopter– risk sharing with emerging systems

Software Issues• Not clear that these are part of our charter but would like

to be sure they are handled– Scaling “Linux” to 1000’s of processors

• Administrated at full scale for capability computing

– Scalable File systems– Need Compiler work to keep pace– Managing Open Source

• Coordinating release implementation• Open source multi-vendor approach-O/S,Languages,Libraries,

debuggers…

– Overhead of MPI is going to swamp the interconnect and hamper scaling

• Need a lower overhead approach to Message Passing

Parallel Computing

• Parallel computing is (now) the path to speed• People think the problem is solved but it’s not• Need new benchmarks that expose true performance of COTS• If the government is willing to invest early even at the chip level there is the potential to

influence design in a way that makes scaling “commodity” systems easier• Parallel computers to be much more general purpose than they are today

– More useful, easier to use, and better balanced– Continued growth of computing may depend on it – To get significantly more performance, we must treat parallel computing as first class– COTS processors especially will be influenced only by a generally applicable approach

Themes From White Papers• Broad Themes

– Exploit Commodity– One system doesn’t fit all applications-For specific family of codes Commodity can be a good solution

– unique topology and algorithmic approaches allow exploitation of current technology• Novel uses of current technology(Overlap with Panel 3)

– RCM Technology- FPGA faster, lower power with multiple units-Hybrid FPGA-core is the traditional processor on chip with logic units-Need H/W architect for RCM-Apps suitable for RCM-RCM are about ease of programming

– Streaming technology utilizing commercial chips– Fine grained multi threading

• Supporting Technology( Overlap with panel 1)– Self managing Self Aware systems– MRAM,EUVL,Micro-channel– Power Aware Computing– High end interconnect and scalable files systems– High performance interconnect technology, optical and others that can scale to large systems– Systems software that scales up gracefully to enormous processor count with reliability,efficiency and and ease

of – There is a natural layering of technologies involved in a high-performance machine:

– the basic silicon,• the cell boards and shared memory nodes, the cluster interconnect, the racks, the cooling, the OS kernel,• the added OS services, the runtime libraries, the compilers and languages, the application libraries.

Relevant White Papers

18 of the 64/80 papers have some relevance to our topic• 6• 10• 12• 16• 17• 31• 33• 39• 45• 46• 47• 50• 65• 68• 72• 75• 80

Working Group 3:Custom-Based Architectures

Chair: Peter Kogge

Vice Chair: Thomas Sterling

WG3 – Architecture: Custom based Charter

• Charter– Identify opportunities and challenges for innovative HEC system architectures,

including alternative execution models, support mechanisms, local element and system structures, and system engineering factors to accelerate rate of sustained performance gain (time to solution), performance to cost, programmability, and robustness. Establish a roadmap of advanced-concept alternative architectures likely to deliver dramatic improvements to user applications through the end of the decade. Specify those critical developments achievable through custom design necessary to realize their potential.

• Chair– Peter Kogge, Notre Dame

• Vice-Chair– Thomas Sterling, California Institute of Technology & Jet Propulsion

Laboratory

WG3 – Architecture: Custom based Guidelines and Questions

• Present driver requirements and opportunities for innovative architectures demanding custom design

• Identify key research opportunities in advanced concepts for HEC architecture

• Determine research and development challenges to promising HEC architecture strategies. Project brief roadmap of potential developments and impact through the end of the decade.

• Specify impact and requirements of future architectures on system software and programming environments.

• Example topics:– System-on-a-chip (SOC), Processor-in-memory (PIM), streaming, vectors,

multithreading, smart networks, execution models, efficiency factors, resource management, memory consistency, synchronization


• Duncan Buell, U. So. Carolina

• George Cotter, NSA

• William Dally, Stanford Un.

• James Davenport, BNL

• Jack Dennis, MIT

• Mootaz Elnozahy, IBM

• Bill Feiereisen, LANL

• Michael Henesey, SRC Computers

• David Fuller, JNIC

• David Kahaner, ATIP

• Peter Kogge, U. Notre Dame

• Norm Kreisman, DOE

• Grant Miller, NCO

• Jose Munoz, NNSA

• Steve Scott, Cray

• Vason Srini, UC Berkeley

• Thomas Sterling, Caltech/JPL

• Gus Uht, U. RI

• Keith Underwood, SNL

• John Wawrzynek, UC Berkeley

Charter (from Charge)• Identify opportunities & challenges for innovative HEC system architectures,

including – alternative execution models, – support mechanisms, – local element and system structures, – and system engineering factors

to accelerate – rate of sustained performance gain (time to solution), – performance to cost, – programmability, – and robustness.

• Establish roadmap of advanced-concept alternative architectures likely to deliver dramatic improvements to user applications through the end of the decade.

• Specify those critical developments achievable through custom design necessary to realize their potential.

Original Guidelines and Questions

• Present driver requirements and opportunities for innovative architectures demanding custom design

• Identify key research opportunities in advanced concepts for HEC architecture

• Determine research and development challenges to promising HEC architecture strategies.

• Project brief roadmap of potential developments and impact through the end of the decade.

• Specify impact and requirements of future architectures on system software and programming environments.

• (new) What role should/do universities play in developments in this area

Outline• What is Custom Architecture (CA)• Endgame Objectives, Benefits, & Challenges• Fundamental Opportunities Delivered by CA• Road Map• Summary Findings• Difficult fundamental challenges• Roles of Universities

What Is Custom Architecture?• Major components designed explicitly and system

balanced for support of scalable, highly parallel HEC systems

• Exploits performance opportunities afforded by device technologies through innovative structures

• Addresses sources of performance degradation (inefficiencies) through specialty hardware and software mechanisms

• Enable higher HEC programming productivity through enhanced execution models

• Should incorporate COTS components where useful without sacrifice of performance

Endgame Objectives

• Enable solution of– Problems we can’t solve now– And larger versions of ones we can solve now

• Base economic model: provides 10 – 100X ops/Lifecycle $ AT SCALE– Vs inefficiencies of COTS

• Significant reduction in real cost of programming– Focus on sustained performance, not peak

Strategic Benefits• Promotes architecture diversity• Performance: ops & bandwidth over COTS

– Peak: 10X – 100X through FPU proliferation – Memory bandwidth 10X-100X through network and signaling technology– Focus on sustainable

• High Efficiency– Dynamic latency hiding– High system bandwidth and low latency– Low overhead

• Enhanced Programmability– Reduced barriers to performance tuning– Enables use of programming models that simplify programming and eliminate

sources of errors

• Scalability– Exploits parallelism at all levels of parallelism

• Cost, size, and power– High compute density

Challenges To Custom

• Small market and limited opportunity to exploit economy of scale

• Development lead time

• Incompatibility with standard ISAs

• Difficulty of porting legacy codes

• Training of users in new execution models

• Unproven in the field

• Need to develop new software infrastructure

• Less frequent technology refresh

• Lack of vendor interest in leading edge small volumes

Fundamental Technical OpportunitiesEnabled by CA

• Enhanced Locality – Increasing Computation/Communication Demand

• Exceptional global bandwidth• Architectures that enable utilization of global bandwidth• Execution models that enable compiler/programmer to

use the above

Enhanced Locality – Increasing Computation/Communication Demand

Mechanisms• Spatial computation via reconfigurable logic• Streams that capture physical locality by observing temporal

locality• Vectors – scalability and locality microarchitecture enhancements• PIM – capture spatial locality via high bandwidth local memory

(low latency)• Deep and explicit register & memory hierarchies

– With software management of hierarchies

Technologies• Chip stacking to increase local B/W

Providing Exceptional Global Bandwidth

Mechanisms:• High radix networks• Non-blocking, bufferless topologies• Hardware congestion control• Compiler scheduled routingTechnologies:• High speed signaling (system-oriented)

– Optical, electrical, heterogeneous (e.g. VCSEL)

• Optical switching & routing• High bandwidth memory device, high densityNotes:• Routing & flow control are nearing optimal

Architectures that Enable Use of Global Bandwidth

Note: This addresses providing the traffic stream to utilize the enhanced network

• Stream and Vectors• Multi-threading (SMT)• Global shared memory (a communication overhead reducer)• Low overhead message passing• Augmenting microprocessors to enhance additional requests

(T3E, Impulse)• Prefetch mechanisms

Execution Models

Note: A good model should:– Expose parallelism to compiler & system s/w– Provide explicit performance cost model for key operations– Not constrain ability to achieve high performance– Ease of programming

• Spatial direct mapped hardware• Resource flow• Streams• Flat vs Dist. Memory (UMA/NUMA vs M.P.)• New memory semantics• CAF and UPC, first good step• Low overhead synchronization mechanisms• PIM-enabled: Traveling threads, message-driven, active pages, ...

Roadmap: When to Expect CA Deployment

• 5 Years or less– Must have relatively mature support s/w (and/or “friendly

users”)

• 5-10 years– Still open research issues in tools & system s/w

– Approaching 10 years if requires mind set change in applications programmers

• 10-15 years: – After 2015 all that’s left in silicon is architecture

Roadmap - 5 Year Period

• Significant research prototype examples– Berkeley Emulation Engine: $0.4M/TF by 2004 on

Immersed Boundary method codes– QCDOC: $1M/TF by 2004– Merrimac Streaming: $40K/TF by 2006– Note: several companies are developing custom

architecture roadmaps

Roadmap - 5 Years or LessTechnologies Ready for Insertion

• High bandwidth network technology can be inserted– No software changes

• SMT: will be ubiquitous within 5 years– But will vendors emphasize single thread performance

in lieu of supporting increased parallelism

• Spatial direct mapped approach

Roadmap - 5 to 10 Years• All prior prototypes could be expanded to

reach PF sustained at competitive recurring $

• Industry is targeting sustained Petaflops– If properly funded

• Need to encourage transfer of research results

• Virtually all of prior technology opportunities will be deployable– Drastic changes to programming will limit

adoption

Roadmap: 10-15 Years

• Silicon scaling at sunset– Circuit, packaging, architecture, and software opportunities

remain

• Need to start looking now at architectures that mesh with end of silicon roadmap and non-silicon technologies– Continue exponential scaling of performance

– Radically different timing/RAS considerations

– Spin out: how to use faulty silicon

Findings• Significant CA-driven opportunities for

enhanced Performance/Programmability– 10-100X potential above COTS at the same time

• Multiple, CA-driven innovations identified for near & medium term– Near term: multiple proof of concept– Medium term: deployment @ petaflops scale

• Above potential will not materialize in current funding culture

Findings (2)• No one side of the community can realize

opportunities of future Custom Architecture:– Strong peer-peer partnering needed between

industry, national labs, & academia– Restart pipeline of HEC & parallel-oriented grad

students & faculty

• Creativity in system S/W & programming environments must support, track, & reflect creativity in HEC architecture

Findings (3)

• Need to start now preparing for end of Moore’s Law and transition into new technologies– If done right, potential for significant trickle back to

silicon

Fundamentally Difficult ChallengesTechnical

• Newer applications for HEC• OS geared specifically to highly scaled systems• How to design HEC for upgradable• High Latency, low bandwidth ratios of memory chips

and systems• File systems• Reliability with unreliable components at large scale• Fundamentally parallel ISAs

Fundamentally Difficult ChallengesCultural

• Instilling change into programming model• Software inertia• How should HEC be viewed

– As a service vs product

• I/O, SAN, Storage systems for HEC• How to define requirements

Universities As A Critical Resource• Provide innovative concepts and long term vision• Provide students• Keeps the research pipeline full• Good at early simulations and prototype tools• Students no longer commonly exposed to massive

parallelism• Parallel computing architecture students in significant

decline, as well as those interested in HEC• Difficult to roll leading edge chips but only place for 1st

generation prototypes of novel concepts• Don’t do well at attacking the hard problems of moving

beyond 1st prototype, or productizing• Soft money makes it hard to keep teams together

Working Group 4: Runtime and Operating Systems

Chair: Rick Stevens

Vice Chair: Ron Brightwell

WG 4– Runtime and OS Charter

• Charter– Establish baseline capabilities required in the operating systems

for projected HEC systems scaled to the end of this decade and determine the critical advances that must be undertaken to meet these goals. Examine the potential, expanded role of low-level runtime system components in support of alternative system architectures.

• Chair– Rick Stevens, Argonne National Laboratory

• Vice-Chair– Ron Brightwell, Sandia National Laboratory

WG 4– Runtime and OS Guidelines and Questions

• Establish principal functional requirements of operating systems for HEC systems of the end of the decade

• Identify current limitations of OS software and determine initiatives required to address them

• Discuss role of open source software for HEC community needs and issues associated with development/maintenance/use of open source

• Examine future role of runtime system software in the management/use of HEC systems containing from thousands to millions of nodes.

• Example topics:– file systems, open source software, Linux, job and task scheduling, security, grid-

interoperable, memory management, fault tolerance, checkpoint/restart, synchronization, runtime, I/O systems


• Ron Brightwell• Neil Pundit• Jeff Brown• Lee Wand• Gary Girder• Ron Minnich• Leslie Hart• DK Panda

• Thuc Hoang

• Bob Balance

• Barney McCabe

• Wes Felter

• Keshav Pingali

• Deborah Crawford

• Asaph Zemach

• Dan Reed

• Rick Stevens

Our Charge

• Establish Principal Functional Requirements of OS/runtime for systems for the end of the decade systems

• Assumptions:– Systems with 100K-1M nodes (fuzzy notion of node) +-order

of magnitude (SMPs, etc.)– COTS and custom targets included

• Role of Open Source in enabling progress• Formulate critical recommendations on research

objectives to address the requirements

Critical Topics• Operating System and Runtime APIs• High-Performance Hardware Abstraction • Scalable Resource Management• File Systems and Data Management• Parallel I/O and External Networks• Fault Management• Configuration Management• OS Portability and Development Productivity• Programming Model Support• Security• OS and Systems Software Development Test beds• Role of Open Source

Recurring Themes• Limitations of UNIX • Blending of OS and runtime models• Coupling apps and OS via feedback mechanisms• Performance transparency (visibility)• Minimalism and enabling applications access to HW• Desire for more hardware support for OS functions• “Clusters” are the current OS/runtime targets• Lack of full-scale test beds limiting progress• OS “people” need to be involved in design decisions

OS APIs (e.g. POSIX)

• Findings:– POSIX APIs not adequate for future systems

• Lack of performance transparency• Global state assumed in POSIX semantics

• Recommendations:– Determine a subset of POSIX APIs suitable for High-

performance Computing at scale– New API development addressing scalability and performance

transparency• Explicitly support research in developing non-POSIX compatible

Hardware Abstractions

• Findings:– HAs needed for portability and improved resource

management• Remove dependence on physical configurations

– virtual processors abstractions (e.g. MPI processes)

• Virtualization to improve resource management– virtual PIMs, etc. for improved programming model support

• Recommendations:– Research to determine what are the right candidates for virtualization– Develop low overhead mechanisms for enabling abstraction– Making abstraction layers visible and optional where needed

Scalable Resource Management

• Findings:– Resource allocation and scheduling at the system and node

level are critical for large-scale HEC systems– Memory hierarchy management will become increasingly

important– Dynamic process creation and dynamic resource management

increasingly important– Systems most likely to be space-shared– OS support required for management of shared resources

(network, I/O, etc.)

Scalable Resource Management• Recommendations:

– Investigate new models for resource management• Enabling user applications to have as much control of low-level resource

management where needed• Compute-Node model

– Minimal runtime and App can bring as much or as little OS with them

• Systems/Services Nodes – Need more OS services to manage shared resources

• I/O systems and fabric need to be managed

– Explore cooperative services model• Some runtime and traditional OS combined into a cooperative scheme,

offload services not considered critical for HEC

– Increase the potential use of dynamic resource management at all levels

Data Management and File Systems• Findings:

– The POSIX model for I/O is incompatible with future systems– The passive file system model may also not be compatible

with requirements for future systems

• Recommendations:– Develop an alternative (to POSIX) API for file system– Investigate scalable authentication and authorization schemes

for data – Research scalable schemes for handling file systems (data

management) metadata– Consider moving processing into the I/O paths (storage

devices)

Parallel and Network I/O• Findings:

– I/O channels will be highly parallel and shared (multiple users/jobs)

– External network and grid interconnects will be highly parallel

– The OS will need to manage I/O and network connections as a shared resource (even in space shared systems)

• Recommendations:– Develop new scalable approaches to supporting I/O and network

interfaces (near term)

– Consider integrating I/O and network interface protocols (medium term)

– Develop HEC appropriate system interfaces to grid services (long term)

Fault Management

• Findings:– Fault management is increasingly critical for HEC systems– The performance impacts of fault detection and management may be

significant and unexpected– Automatic fault recovery may not be appropriate in some cases– Fault prediction will become increasingly critical

• Recommendations:– Efficient schemes for fault detection and prediction

• What can be done in hardware?

– Improved runtime handling (graceful degradation) of faults– Investigate integration of fault management, diagnostics with advanced

configuration management– Autonomic computing ideas relevant here

Configuration Management

• Findings:– Scalability of management tools needs to be improved

• Manage to a provable state (database driven management)

– Support interrupted firmware/software update cycles (surviving partial updates)

– New models of configuration (away from file based systems) may be important directions for the future

• Recommendations:– New models for systems configuration needed– Scalability research (scale invariance, abstractions)– Develop interruptible update schemes (steal from database technologies)– Fall back, fall forward– Automatic local consistency

OS Portability

• Findings:– Improving OS portability and OS/runtime code reuse will improve OS

development productivity• Device drivers (abstractions)

• Shared code base and modular software technology

• Recommendations:– Develop new requirements for device driver interfaces

• Support unification where possible and where performance permits

– Consider developing a common runtime execution software platform

– Research toward improving use of modularization and components in OS/runtime development

OS Security for HEC Systems

• Findings:– Current (nearly 30 year old) Unix security model has significant

limitations– Multi-level Security (orange book like) may be a requirement for some

HEC systems– Current Unix security model is deeply coupled to current OS semantics

and limits scalability in many cases

• Recommendations:– Active resource models

• Rootless, UIDless, etc.

– Eros, Plan 9 models possible starting point– Fund research explicitly different from UNIX

Programming Model Support in OS

• Findings:– MPI has productivity limitations, but is the current standard for

portable programming, need to push beyond MPI– UPC and CAF considered good candidates for improving

productivity and probably should be targets for improved OS support

• Recommendations:– Determine OS level support needed for UPC and CAF and

accelerate support for these (near term)– Performance and productivity tool support (debuggers,

performance tools, etc.)

Testbeds for Runtime and OS

• Findings:– Lack of full scale test beds have slowed research in scalable

OS and systems software– Test beds need to be configured to support aggressive testing

and development

• Recommendations:– Establish one or more full scale (1,000’s nodes) test beds for

runtime, OS and Systems software research communities– Make test beds available to University, Laboratory and

Commercial developers

The Role of Open Source

• Findings:– Open source model for licensing and sharing of software valuable for HEC

OS and runtime development

– Open source (open community) development model may not be appropriate for HEC OS development

– The Open Source contract model may prove useful (LUSTRE model)

• Recommendations:– Encourage use of open source to increase leverage in OS development

– Consider creating and funding an Institute for HEC OS/rumtime Open Source development and maintenance (keeping the HEC community in control of key software systems)

Working Group 5Programming Environments and

Tools

Chair: Dennis Gannon

Vice Chair: Rich Hirsh

WG5 – Programming Environments and Tools Charter

• Charter– Address programming environments for both existing legacy codes and

alternative programming models to maintain continuity of current practices, while also enabling advances in software development, debugging, performance tuning, maintenance, interoperability and robustness. Establish key strategies and initiatives required to improve time to solution and ensure the viability and sustainability of applying HEC systems by the end of the decade.

• Chair– Dennis Gannon, Indiana University

• Vice-Chair– Rich Hirsh, NSF

WG5 – Programming Environments and Tools

Guidelines and Questions• Assume two possible paths to future programming environments:

– incremental evolution of existing programming languages and tools consistent with portability of legacy codes

– innovative programming models that dramatically advance user productivity and system efficiency/performance

• Specify requirements of programming environments and programmer training consistent with incremental evolution, including legacy applications

• Identify required attributes and opportunities of innovative programming methodologies for future HEC systems

• Determine key initiatives to improve productivity and reduce time-to-solution along both paths to future programming environments

• Example topics: – Programming models, portability, debugging, performance tuning, compilers

Charter

• Address programming environments for both existing legacy codes and alternative programming models to maintain continuity of current practices, while also enabling advances in software development, debugging, performance tuning, maintenance, interoperability and robustness.

• Establish key strategies and initiatives required to improve time to solution and ensure the viability and sustainability of applying HEC systems by the end of the decade.

Guidelines• Assume two possible paths to future programming

environments:– incremental evolution of existing programming languages and tools

consistent with portability of legacy codes

– innovative programming models that dramatically advance user productivity and system efficiency/performance

• Specify requirements of programming environments and programmer training consistent with incremental evolution, including legacy applications

• Identify required attributes and opportunities of innovative programming methodologies for future HEC systems

• Determine key initiatives to improve productivity and reduce time-to-solution along both paths to future programming environments

Key Findings

• Revitalizing evolutionary progress requires a dramatically increased investment in – Improving the quality/availability/usability of software development

lifecycle tools– Building interoperable libraries and component/application

frameworks that simplify the development of HEC applications

• Revitalizing basic research in revolutionary HEC programming technology to improve time-to-solution:– Higher Level programming models for HEC software developers that

improve productivity– Research on the hardware/software boundary to improve HEC

application performance

The Strategy• Need an attitude change about software funding for HEC.

– Software is a major cost component for all modern complex technologies.

• Mission critical and basic research HEC software is not provided by industry– Need federally funded, management and coordination of the development

of high end software tools.– Funding is needed for

• Basic research and software prototypes• Technology Transfer:

– moving successful research prototypes into real production quality software.

– Structural changes are needed to support sustained engineering• Software capitalization program• Institute for HEC advanced software development and support.

– Could be a cooperative effort between industry, labs, universities.

The Strategy

• A new approach is needed to education for HEC.– A national curriculum is needed for high performance computing.– Continuing education and building interdisciplinary science research.– A national HEC testbed for education and research

The State of the Art in HEC Programming

• Languages (used in Legacy software)– A blend of traditional scientific programming languages, scripting

languages plus parallel communication libraries and parallel extensions

• (Fortran 66-95, C++, C, Python, Matlab )+MPI+OpenMP/threads, HPF

• Programming Models in current use– Traditional serial programming

– Global address space or partitioned memory space (mpi+on linux cluster)

– SPMD vs MPMD

The Evolutionary Path Forward Already Exists

• For Languages– Co-array Fortran, UPC, Adaptive MPI, specialized C++ template libraries

• For Models– Automatic parallelization of whole-program serial legacies no longer

considered sufficient, • but it is important for code generation for procedure bodies on modern

processors.

– multi-paradigm parallel programming is desirable goal and within reach

Short Term Needs• There is clearly very slow progress in evolving HEC

software practices to new languages and programming models. The rest of the software industry is moving much faster. – What is the problem?

– Scientists/engineers continue to use the old approaches because it is still perceived as the shortest path to the goal … a running code.

• In the short term, we need– A major initiative to improve the software design, debugging,

testing and maintenance environment for HEC systems

The Components of a Solution

• Our applications are rapidly evolving to multi-language, multi-disciplinary, multi-paradigm software systems– High end computing has been shut out of a revolution in

software tools • When tools have been available centers can’t afford to buy them.

• Scientific programmers are not trained in software engineering.

– For example, industrial quality build, configure and testing tools are not available for HEC applications/languages.

• We need portable of software maintenance tools across HEC platforms.

The Components of a Solution

• We need a rapid evolution of all language processing tools – Extensible standards are needed: examples -language object file format,

compiler intermediate forms.

– Want complete interoperability of all software lifecycle tools.

• Performance analysis should be part of every step of the life cycle of a parallel program– Feedback from program execution can drive automatic analysis and

optimization.

The Evolution of HEC Software Libraries

• The increasing complexity of scientific software (multi-disciplinary, multi-paradigm) has other side effects– Libraries are an essential way to encapsulate

algorithmic complexity but• Parallel libraries are often difficult to compose because of low level

conflicts over resources. • Libraries often require low-level flat interfaces. We need a mechanism

to exchange more complex and interesting data structures.

• Software Component Technology and Domain-specific application frameworks are one solution

Components and Application Frameworks

• Provides an approach to factoring legacy into reusable components that can be flexibly composed.– Resource management is managed by the framework and components encapsulate

algorithmic functionality

• Provides for polymorphism and evolvability. • Abstract hardware/software boundary and allow better language

independence/interoperability.• Testing/validating is made easier. Can insure components can be trusted.• May enable a marketplace of software libraries and components for HEC

systems.• However, no free lunch.

– It may be faster to build a reliable application from reusable components, but will it have performance scalability?

• Initial results indicate the answer is yes.

Revolutionary Approaches:The Long Range View

• We still have problems getting efficiency out of large scale cluster architectures.

• A long range program of research is needed to explore– New programming models for HEC systems– Scientific languages of the future:

• Scientist does not think about concurrency but rather science.• Expressing concurrency as the natural parallelism in the problem.

– Integrating locality model into the problem can be the real challenge

• Languages built from first principles to support the appropriate abstractions for scalable parallel scientific codes (e.g. ZPL).

New Abstractions for Parallel Program Models

• Approaches that promote automatic resource management.• Integration of user-domain abstractions into compilation.

– Extensible Compilers

• Telescoping languages– application level languages transformed into high level parallel languages

transformed into … Locality may be part of new programming models.

• To be able to publish and discover algorithms.• Automatic generation of missing components.• Integrating persistence into programming model.• Better support for transactional interactions in applications

Programming Abstractions cont.

• Better separation of data structure and algorithms.• Programming by contract

– Quality of service, performance and correctness

• Integration of declarative and procedural programming• Roundtrip engineering/model driven software

– Reengineering: Specification to design and back

• Type systems that have better support for architectural properties.

Research on the Hardware/Software Boundary

• Instruction set architecture – Performance counters, interaction with VM and Memory

Hierarch

• Open bidirectional APIs between hardware and software • Programming methodology for reconfigurable hardware

will be a significant challenge.• Changing memory consistence models depending on

applications.

Research on the Hardware/Software Boundary

• Predictability (scheduling, fine-grained timing, memory mapping) is essential for scalable optimization.

• Fault Tolerance/awareness– For systems with millions of processors the applications/runtime/os will

need to be aware of and have mechanisms to deal with faults.– Need mechanisms to identify and deal with faults at every level.– Develop programming models that better support non-determinism

(including desirable but boundedly-incorrect results).

Hardware/Software Boundary: Memory Hierarchy

• There are limits to what we can do with legacy code that has bad memory locality problems.

• Software needs better control data structure-to-hierarchy layout.

• New solutions:– Cache aware/cache oblivious algorithms

– Need more research on the role of virtual memory or file caching.

– Threads can be used to hide latency.

– New ways to think about data structures. • First class support for hierarchical data structures.

– Streaming models

– Integration of persistence and aggressive use of temporal locality.

– Separation of algorithm and data structure, i.e. generic programming.

– Support from system software/hardware to control aspects of memory hierarchy.

Best Practices and Education

• Education is crucial for the effective use of HEC systems.– Apps are more interdisciplinary

• Requires interdisciplinary teams of people: – Drives need for better software engineering.

– Application scientists does not need to be an expert on parallel programming.

• Multi-disciplinary teams including computer scientist.

– Students need to be motivated to learn that performance is fun.• Updated curriculum to use HEC systems. • Educators/student need access to HEC systems

– Need to increase support for student fellowships in HEC.

Working Group 6Performance Modeling, Metrics

and Specifications

Chair: David Bailey

Vice Chair: Allen Snavely

WG6 – Performance Modeling, Metrics, and Specifications

Charter• Charter

– Establish objectives of future performance metrics and measurement techniques to characterize system value and productivity to users and institutions. Identify strategies for evaluation including benchmarking of existing and proposed systems in support of user applications. Determine parameters for specification of system attributes and properties.

• Chair– David Bailey, Lawrence Berkeley National Laboratory

• Vice-Chair– Allan Snavely, UC San Diego

WG6 – Performance Modeling, Metrics, and Specifications

Guidelines and Questions• As input to HECRTF charge (2c), provide information about the types of

system design specifications needed to effectively meet various application domain requirements.

• Examine current state and value of performance modeling, metrics for HEC and recommend key extensions

• Analyze performance-based procurement specifications for HEC that lead to appropriately balanced systems.

• Recommend initiatives needed to overcome current limitations in this area.

• Example topics:– Metrics, time to solution, measurement and modeling methods, benchmarking,

specification parameters, time to solution and relationship to fault-tolerance


• David Bailey• Stan Ahalt• Stephen Ashby• Rupak Biswas• Patrick Bohrer

Carleton DeTar• Jack Dongarra• Ahmed Gameh• Brent Gorda• Adolfy Hoisie

• Sally McKee• David Nelson• Allan Snavely• Carleton DeTar• Jeffrey Vetter• Theresa Windus• Patrick Worley• and others

Charter

• Establish objectives of future performance metrics and measurement techniques to characterize system value and productivity to users and institutions. Identify strategies for evaluation including benchmarking of existing and proposed systems in support of user applications. Determine parameters for specification of system attributes and properties.

Fundamental MetricsBest single overriding metric: time to solution.

Time to solution includes:• Execution time.• Time spent in batch queues.• System background interrupts and other overhead.• Time lost due to scheduling inefficiencies, downtime.• Programming time, debugging and tuning time.• Pre-processing: grid generation, problem definition, etc.• Post-processing: output data management, visualization, etc.

System balance depends on workload – there is no one formula for all applications.

Related Factors

• Programming time and difficulty – Must be better understood (and reduced).– Identify key factors affecting development time.– Identify HPC relevant techniques from software engineering.– Closely connected to research in programming models and languages.

• System-level efficiency– Some metrics exist (i.e. ESP).

• Performance stability• Grid generation, problem definition, etc.

– For some applications, this step requires effort more than the computational step.

– No good metrics at present time.

Current Best Practice for Procurements

• Characterize machines via micro-benchmarks and synthetic benchmarks run on available machines.– Numerous general specifications.– Results on some standard benchmarks.– Results on application benchmarks (different for each procurement).

• Identify and track applications of interest. – Use modeling to characterize performance.– Validate models on largest available system of that kind.

• Optimization problem–solving with constraints, including performance, dollars, floor space, power.– This step is not standardized, currently ad hoc.

This approach is inadequate to select systems 10x or more beyond systems in use at a given point in time.

Toward Performance-Based System Selection

• Procurements or other system selections should not be based on any single figure of merit.

• Can various agencies converge on a reference set of discipline-specific benchmark applications?

• On a set of micro-benchmarks?• How can we better handle intellectual property and

classified code issues in procurements?

• Accurate performance modeling holds the best promise for simplifying procurement benchmarking.

Performance Modeling

• Goals: A set of low-level basic system metrics, plus a solid methodology for accurately projecting the performance of a specific high-level application program on a specific high-end system.

• Challenges:– Current approaches require significant skill and expertise.– Current approaches require large amounts of run time.– Fast, nearly automatic, easy-to-use schemes are needed.

• Benefits:– Architecture research– Procurements– Vendors– Users

Potential Modeling Impact

• Influence architecture early in the design cycle• Improve applications development

– Use modeling in the entire lifecycle of an application, including algorithmic selection, code development, software engineering, deployment, tuning.

• Impact assessment– Project new science enabled by a proposed petaflop system.

• Research needed in:– Novel approaches to performance modeling: analytical, statistical, kernels

and benchmarks, synthetic programs.– How to deal with exploding quantity of performance data on systems with

10,000+ CPUs.– Online reduction of trace data.

System Simulation

Salishan Conference, Apr. 2003: “Computational scientists have become quite expert in using high-end computers to model everything except the systems they run on.”

Research in the parallel discrete event simulation (PDES) field now makes it possible to:

• Develop a modular open-source system simulation facility, to be used by researchers and vendors.– Prime application: modeling very large-scale inter-processor

networks.

• Need to work with vendors to resolve potential intellectual property issues.

Tools and Standards

• Characterized workloads from different agencies– Establishing common set of low-level micro-benchmarks predictive of

performance.– In-depth characterization of applications incorporated in a common performance

modeling framework.– Enables comparability of models and cooperative sharing of workload

requirements.

• A standardized simulation framework for modeling and predicting performance of future machines.

• Diagnostic tools to reveal factors affecting performance on existing machines .

• Intelligent, visualization-based facilities to locate “hot spots” and other performance anomalies.

Performance Tuning

• Self-tuning library software: FFTW, Atlas, LAPACK.• Near-term (1-5 yrs):

– Extend to numerous other scientific libraries.

• Mid-term (5-10 yrs): – Develop prototype pre-processor tools that can extend this technology to ordinary

user-written codes.

• Long-term (10-15 yrs):– Incorporate this technology into compilers.

Example from history–vectorization:– Step 1: Completely manual, explicit vectorization– Step 2: Semi-automatic vectorization, using directives– Step 3: Generate both scalar and vector code, selected with run-time analysis

Working Group 7Application-Driven System

Requirements

Chair: Mike Norman

Vice Chair: John Van Rosendale

WG7 – Application-driven System Requirements

Charter• Charter

– Identify major classes of applications likely to dominate HEC system usage by the end of the decade. Determine machine properties (floating point performance, memory, interconnect performance, I/O capability and mass storage capacity) needed to enable major progress in each of the classes of applications. Discuss the impact of system architecture on applications. Determine the software tools needed to enable application development and support for execution. Consider the user support attributes including ease of use required to enable effective use of HEC systems.

• Chair– Mike Norman, University of California at San Diego

• Vice-Chair– John Van Rosendale, DOE

WG7 – Application-driven System Requirements

Guidelines and Questions • Identify major classes of applications likely to dominate use of HEC

systems in the coming decade, and determine the scale of resources needed to make important progress. For each class indicate the major hardware, software and algorithmic challenges.

• Determine the range of critical systems parameters needed to make major progress on the applications that have been identified. Indicate the extent to which system architecture effects productivity for these applications.

• Identify key user environment requirements, including code development and performance analysis tools, staff support, mass storage facilities, and networks.

• Example topics: – applications, algorithms, hardware and software requirements, user support

Discipline Coverage

• Lattice Gauge Theory• Accelerator Physics• Magnetic Fusion• Chemistry and Environmental Cleanup• Bio-molecules and Bio-Systems• Materials Science and Nanoscience• Astrophysics and Cosmology• Earth Sciences• Aviation

FINDING #1

Top Challenges

• Achieving high sustained performance on complex applications becoming more and more difficult

• Building and maintaining complex applications

• Managing data tsunami (input and output)

• Integrating multi-scale space and time, multi-disciplinary simulations

Multi-Scale Simulation in Nanoscience

Maciej Gutowski, WP 001

Question 1• Identify major classes of

applications likely to dominate use of HEC systems in the coming decade, and determine the scale of resources needed to make important progress. For each class indicate the major hardware, software and algorithmic challenges.

1 cm

1027 cm

Question 2

• Determine the range of critical systems parameters needed to make major progress on the applications that have been identified. Indicate the extent to which system architecture effects productivity for these applications.

Question 3

• Identify key user environment requirements, including code development and performance analysis tools, staff support, mass storage facilities, and networks.

Findings: HW [1]

• 100x current sustained performance needed now in many disciplines to reach concrete objectives

• A spectrum of architectures is needed to meet varying application requirements– Customizable COTS an emerging reality– Closer coupling of application developers with computer

designer needed• The time dimension is sequential: difficult to

parallelize – ultrafast processors and new algorithms are required. – fusion, climate simulation, biomolecular, astrophysics:

multiscale problems in general

Findings: HW [2]

• Thousands of CPUs useful with present codes and algorithms; reservations about 10,000 (scalability and reliability)– Some applications can effectively exploit 1000s of cpus only

by allowing problem size to grow (weak scaling)

• Memory bandwidth and latency seems to be a universal issue

• Communication fabric latency/bandwidth is a critical issue: applications vary greatly in their communications needs

Findings: Software

• SW model of single-programmer monolithic codes is running out of steam – need to switch to a team-based approach (a’la SciDAC)– scientists, application developers, applied mathematicians,

computer scientists– modern SW practices for rapid response

• Multi-scale and/or multi-disciplinary integration is a social as well as a technical challenge– new team structures and new mechanisms to support

collaboration are needed– intellectual effort is distributed, not centralized

Findings: User Environment

• Emerging data management challenge in all sciences; e.g., bio-sciences

• Massive shared memory architectures for data analysis/assimilation/mining – TB’s / day (NCAR/GFDL, NERSC, DOE Genome

to Life, HEP)– sequential ingest/analysis codes– I/O-centric architectures

• HEC Visualization environments a la DOE Data Corridors

Strategy and Policy [1]

• HEC has become essential to the advancement of many fields of science & engineering

• US scientific leadership in jeopardy without increased and balanced investment in HEC hardware and wetware (i.e., people)

• 100x increase of current sustained performance needed now to maintain scientific leadership

Strategy and Policy [2]

• A spectrum of architectures is needed to meet varying application requirements

• New institutional structures needed for disciplinary computational science teams (research facility model)– An integrated answer to Question 3

Facilities Analogy

Fe

Ultra-highvacuum station

Sample

Neutron Reflectometer

National User Facility

User Interface“End Station”

Materials Science Research Network

Standards Based - Tool KitsOpen Source Repository Workshops Education

Materials : Math : ComputerScientists

HPC FacilitiesCollaborative

Research Teams

Users

Domain SpecificResearch Networks

Nano-Magnetism

Strongly Correlated Materials

Magnetism CRT

Correlation CRT

Microstructure CRT

Direction of competition

Spallation Neutron Source (SNS)

NERSC

ORNL-CCS

PSC

Fusion

High Res. Triple Axis

Small Angle Scattering

QCD

Fusion CRT 1

QCD CRT 1

Polymer Science

Dynamics

Working Group 8Procurement, Accessibility and

Cost of Ownership

Chair: Frank Thames

Vice-Chair: Jim Kasdorf

WG8 – Procurement, Accessibility, and Cost of Ownership

Charter• Charter

– Explore the principal factors affecting acquisition and operation of HEC systems through the end of this decade. Identify those improvements required in procurement methods and means of user allocation and access. Determine the major factors contributing to the cost of ownership of the HEC system over its lifetime. Identify impact of procurement strategy including benchmarks on sustained availability of systems.

• Chair– Frank Thames, NASA

• Vice-Chair– Jim Kasdorf, Pittsburgh Supercomputing Center

WG8 – Procurement, Accessibility, and Cost of Ownership

Guidelines and Questions• Evaluate the implications of the virtuous infrastructure cycle i.e. the

relationship among the advanced procurement development and deployment for shaping research, development, and procurement of HEC systems.

• As input to HECRTF charge (3c), provide information about total cost of ownership beyond procurement cost, including space, maintenance, utilities, upgradeability, etc.

• As input to HECRTF charge (3) overall, provide information about how the Federal government can improve the processes of procuring and providing access to HEC systems and tools

• Example topics:– procurement, requirements specification, user infrastructure, remote access, allocation

policies, security, power and cooling costs, maintenance costs, reliability and support

Working Group Participants• Frank Thames• Jim Kasdorf• Bill Turnbull• Gary Wohl• Candace Culhane• James Tomkins• Charles W. Hayes• Sander Lee• Charles Slocomb• Christopher Jehn• Matt Leininger• Mark Seager

• Gary Walter• Graciela Narcho• Dale Spangenberg• Thomas Zacharia• Gene Bal• Per Nyberg• Scott Studham• Rene Copeland• Paul Muzio• Phil Webster• Steve Perry• Cray Henry• Tom Page

WG8 Paper Presentations

• Per Nyberg: Total Cost of Ownership• Matt Leininger: A Capacity First Strategy to U.S.

HEC• Steve Perry: Improving the Process of Procuring

HEC Systems• Scott Studham: Best Practices for the

Procurement of High Performance Computers by the Federal Government

Total Cost of Ownership

• Procurement of Capital Assets– Hardware– Acquisition cost (FTE)– Cost of money for LTOPS– Software licenses

• Maintenance of Capital Assets• Services (Workforce dominated; will inflate yearly)

– Application support/porting– System administration– Operations– Security

Total Cost of Ownership• Facility

– Site Preparation– HVAC– Electrical power– Maintenance– Initial construction– Floor space

• Networks: Local and WAN• Training• Miscellaneous

– Residual value of equipment– Disposal of assets– Insurance

Total Cost of Ownership• Can “Lost Opportunity” cost be quantified?

– Lost research opportunities– Lower productivity due to lack of tools– Codes not optimized for the architecture– Etc.

• Replacement cost of human resources• Difficulty in valuing system software as it

impacts productivity (development and production) vice quantitative methods to measure hardware performance

Total Cost of Ownership• Other Considerations

– If costs are to include end-to-end services• Output analysis must be added (e.g., visualization)

• Mass Storage

• Application Development– Some architectures are harder to program (ASCI: 4-6

years application development; application lifetime: 10-20 years)

– H/W architectures last 3-4 years applications must last over multiple architectures

Total Cost of Ownership – Bottom Line

• Consider ALL applicable factors

• Some are not obvious

• Develop a comprehensive cost candidate list

Procurement

• Requirements Specification

• Evaluation Criteria

• Improving the Process

• Contract Type

• Other Considerations

Procurement• Requirements Specification

– Elucidate the fundamental science requirement

– Emphasize quantifiable Functional requirements

– Exploit economies of scale

– Application development environment

– Make optimum use of contract options and modifications

– Maximize the use of technical partnerships

– Consider flexible delivery dates where applicable (increases vendor flexibility)

Procurement (Continued)

• Requirements Specification (Continued)– Be careful about “mandatory” requirements prioritize or

weight them

– Be aware of specifications which may limit competition

– Avoid “over specifying” requirements for advanced systems

– Fundamental differences in specification depending on the intended use of the system (Natural tension between Capacity vs Capability and general tool vs specific research tool)

Procurement (Continued)• Evaluation Criteria

– For options on long-term contracts, projected “speedup” of applications

– Total Cost of Ownership

– Use “Real Benchmarks”• Be careful not to water down benchmarks too much

• On the other hand, don’t push so hard that some vendors can’t afford it

• Other approaches needed for future advanced systems

– Use Best Value

– Risks

Procurement (Continued)• Improving the Process

– Insure users are heavily involved in the process• Eases vendor risk mitigation

• Users have “decision proximity”

– Non-disclosures required by vendors hamstring government personnel after award

– Maintain communications between vendors and customers during the acquisition cycle without compromising fairness


• Improving the Process (Continued)– Consider DARPA HPCS Process for “Advanced

Systems”• Multiple down-selects

• R&D like

• Leads to production system at end

– Attempt to maintain acquisition schedule adherence


• Contract Type– Consider Cost Plus contracts for new technology systems or

those with inherent risks (e.g., development contracts)

– Leverage existing contracts that fit what you want to do

• Other Considerations– Don’t have a single acquisition for ALL HEC in government

• Leads to “Ivory Tower” syndrome and

• A disconnect from users

• Bottom Line: don’t over Centralize


• Other Considerations (Continued)– Inconsistencies in way acquisition regulations are

implemented can lead to inefficiencies (vendor issue)– Practices that would revitalize the HEC industry

• What size of market is needed: At least several hundred million dollars per year per vendor

• Recognize that HEC vendors must make an acceptable return to survive and invest

Accessibility

• Key Issue: Funding in the requiring Agency to purchase computational capabilities from other sources

• There are many valid vehicles to providing interagency agreements to provide accessibility (e.g., Interagency MOU’s)

• Suggested Process: DOE Office of Science and NSF process – open scientific merit evaluated on a project by project basis

• Current large sources would add x% capability to supply computational capabilities to smaller agencies

• Implementation suggestion: Consider providing a single POC for Agencies for HEC access (NCO?)

working group 1 enabling technologies chair: sheila vaidya vice chair: stu feldman

Documents

software technologies

highend systems

basic technologies

key technologies

new generations of hec

hec capability

advanced hec capabilities

chairsheila vaidya