recent hpc research trends and strategy in the united states
TRANSCRIPT
Recent HPC Research Trends and Strategy in the United States Professor William Kramer
National Center for Supercomputing Applications, University of Illinois http://bluewaters.ncsa.illinois.edu
Extreme Scale Motivations • Science and Research Drivers require more investment in
computing, but is more always the biggest? • More investment • Better/bigger systems • More efficient
• National Security and leader in HW and SW innovations • National Competitiveness for science, research and for
industry • Engagement/NRE funding for technology vendors
Impacts of Extreme Scale Computing - Oct 2017
• National • “Whole of government” approach • Public/private partnership with industry
and academia • Strategic
• Leverage beyond individual programs • Long time horizon (decade or more)
• Computing • HPC as advanced, capable computing technology • Multiple styles of computing and all necessary infrastructure • Scope includes everything necessary for a fully integrated capability
• Initiative • Above baseline effort • Link and lift efforts
National Strategic Computing Initiative Executive Order Signed July 29, 2015
Enhance U.S. strategic advantage in HPC for security, economic competitiveness, and scientific discovery
Impacts of Extreme Scale Computing - Oct 2017
https://www.nitrd.gov/nsci/index.aspx http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201512/Szulman_ASCAC_Briefing_120915.pdf
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201512/Szulman_ASCAC_Briefing_120915.pdf Impacts of Extreme Scale Computing - Oct 2017
NSCI Objectives 1. Accelerate delivery of a capable exascale computing system (hardware,
software) to deliver approximately 100X the performance of current 10PF systems across a range of applications reflecting government needs
2. Increase coherence between technology base used for modeling and simulation and that used for data analytic computing.
3. Establish, over the next 15 years, a viable path forward for future HPC systems in the post Moore’s Law …
4. Increase the capacity and capability of an enduring national HPC ecosystem, employing a holistic approach … networking, workflow, downward scaling, foundational algorithms and software, and workforce development.
5. Develop an enduring public-private partnership to assure that the benefits .. are transferred to the U.S. commercial, government, and academic sectors
Impacts of Extreme Scale Computing - Oct 2017
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201512/Szulman_ASCAC_Briefing_120915.pdf
The Government’s Co-Leader Roles in NSCI • DOE
• Capable exascale program • NSF
• Scientific discovery • Broader HPC ecosystem • Workforce Development
• DOD • Analytic computing to support missions: science and national
security • IARPA + NIST
• Future computing technologies • NASA, FBI, NIH, DHS, NOAA
• Deployment within their mission contexts
Impacts of Extreme Scale Computing - Oct 2017
National AI Initiative • Additional National
Initiative – Oct 2016 • Synergistic with some
aspects of NSCI, but not linked
• Implementation and other details are to come
Impacts of Extreme Scale Computing - Oct 2017
National Science Foundation Initiatives • Strong Leadership in the National Strategic Computing Initiative
• Kept advocating sustained, productive technology that may have helped focus other programs
• National Academy report • Plans for provisioning new resources • Plans for Computer Science and Computer Engineering research
• NSF provides >80% of funding for CS research • Software Centers • Data focused Software and Infrastructure • NSF supports a large amount of computing for the National Institutes
of Health
Impacts of Extreme Scale Computing - Oct 2017
National Academy Report for NSF Future Leadership Computing Infrastructure
Impacts of Extreme Scale Computing - Oct 2017
• Recommendations 1. sustain and seek to grow its investments in advanced computing—to include
hardware and services, software and algorithms, and expertise 2. provide support for the revolution in data-driven science along with
simulation 3. collect community requirements and construct and publish roadmaps 4. allow investments … to be considered in an integrated manner 5. support the development and maintenance of expertise, scientific software,
and software tools … to make efficient use of its advanced computing resources
6. invest modestly to explore next-generation hardware and software technologies
7. manage advanced computing investments in a more predictable and sustainable way.
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Irene Qualters-NSF
“Big Data” Data Analytics
High-Performance Modeling And Simulation
Large Scale Data Driven
Modeling And Simulation
Dat
a In
tens
ity
Computational Intensity
NSF Aspirations for Convergence
Impacts of Extreme Scale Computing - Oct 2017
Information Courtesy of Irene Qualters-NSF
2. Increase coherence between technology base used for modeling and simulation and that used for data analytic
computing
Modeling and Simulation - Multi-scale - Multi-physics - Multi-resolution - Multidisciplinary - Coupled models Data Science - Data Assimilation - Visualization - Image Analysis - Data Compression - Data Analytics
NSF Role: Support foundational research and research infrastructure within and across all disciplines (across all NSF directorates)
Time
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Irene Qualters-NSF
NSF role in NSCI: Enduring Computational Ecosystem for Advancing Science and Engineering
Fundamental Research in HPC Platform
Technologies, Architectures, and
Approaches
Infrastructure Platform Pilots, Development and
Deployment
Computational and Data Literacy across all STEM
disciplines
Computational and Data Enabled Science and
Engineering Discovery
Impacts of Extreme Scale Computing - Oct 2017
Information Courtesy of Irene Qualters-NSF
What does DOE mean by Capable Exascale?
• Paul Messina, Director of the Exascale Project has this working definition of “capable exacale” from a presentation he delivered :
• A capable exascale system is defined as a supercomputer that can solve science problems 50X faster (or more complex) than on the 20PF systems (Titan, Sequoia) of today in a power envelope of 20-30 MW and is sufficiently resilient that user intervention due to hardware or system faults is on the order of a week on average.
• And has a software stack that meets the needs of a broad spectrum of applications and workloads.
Impacts of Extreme Scale Computing - Oct 2017
DOE Exascale Computing Project Goals
Develop scientific, engineering, and
large-data applications that
exploit the emerging, exascale-era computational
trends caused by the end of Dennard
scaling and Moore’s law
Foster application
development
Create software that makes exascale systems usable
by a wide variety of scientists
and engineers across a range of
applications
Ease of use
Enable by 2023 ≥ two diverse
computing platforms with up to 50× more
computational capability than today’s 20 PF
systems, within a similar size, cost,
and power footprint
≥ Two diverse architectures
Help ensure continued American
leadership in architecture, software and
applications to support scientific discovery, energy
assurance, stockpile stewardship, and nonproliferation
programs and policies
US HPC leadership
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Paul Messina-ANL
ECP is an holistic approach that uses co-design and integration Application
Development Software
Technology Hardware
Technology Exascale Systems
Scalable and productive software
stack
Science and mission
applications
Hardware technology elements
Integrated exascale
supercomputers
Correctness Visualization Data Analysis
Applications Co-Design
Programming models, development environment,
and runtimes Tools Math libraries and
Frameworks
System Software, resource management threading,
scheduling, monitoring, and control
Memory and Burst buffer
Data management
I/O and file system
Node OS, runtimes
Resil
ienc
e
Wor
kflo
ws
Hardware interface
ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce development
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Paul Messina-ANL
ECP Mission Need Defines the Application Strategy
• Materials discovery and design
• Climate science • Nuclear energy • Combustion science • Large-data applications • Fusion energy • National security • Additive manufacturing • Many others!
• Stockpile Stewardship Annual Assessment and Significant Finding Investigations
• Robust uncertainty quantification (UQ) techniques in support of lifetime extension programs
• Understanding evolving nuclear threats posed by adversaries and in developing policies to mitigate these threats
• Discover and characterize next-generation materials
• Systematically understand and improve chemical processes
• Analyze the extremely large datasets resulting from the next generation of particle physics experiments
• Extract knowledge from systems-biology studies of the microbiome
• Advance applied energy technologies (e.g., whole-device models of plasma-based fusion systems)
Key science and technology challenges to be addressed
with exascale Meet national security needs
Support DOE science and energy missions
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Paul Messina-ANL
Recent ST Selections Mapped to Software Stack Correctness Visualization
VTK-m, ALPINE (ParaView, VisIt) Data Analysis
ALPINE
Applications Co-Design
Programming Models, Development Environment, and Runtimes
MPI (MPICH, Open MPI), OpenMP, OpenACC, PGAS (UPC++, Global Arrays), Task-Based
(PaRSEC, Legion), RAJA, Kokkos, Runtime library for power steering
System Software, Resource Management Threading, Scheduling,
Monitoring, and Control Qthreads, Argobots, global resource
management
Tools PAPI, HPCToolkit, Darshan
(I/O), Perf. portability (ROSE, Autotuning,
PROTEAS, OpenMP), Compilers (LLVM, Flang)
Math Libraries/Frameworks ScaLAPACK, DPLASMA, MAGMA,
PETSc/TAO, Trilinos Fortran, xSDK, PEEKS, SuperLU, STRUMPACK,
SUNDIALS, DTK, TASMANIAN, AMP
Memory and Burst buffer
Chkpt/Restart (UNIFYCR), API and library for complex memory
hierarchy
Data Management, I/O and File System
ExaHDF5, PnetCDF, ROMIO, ADIOS,
Chkpt/Restart (VeloC), Compression, I/O services Node OS, low-level runtimes
Argo OS enhancements
Resil
ienc
e Ch
eckp
oint
/Res
tart
(Vel
oC, U
NIF
YCR)
Wor
kflo
ws
Hardware interface
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Paul Messina-ANL
Develop the technology needed to build and support the Exascale systems
The Exascale Computing Project requires Hardware Technology
R&D to enhance application and system performance for science,
engineering and data-analytics applications
on exascale systems
Support hardware architecture R&D at both the node and system architecture levels
Prioritize R&D activities that address ECP performance objectives for the
initial Exascale System RFPs
Enable Application Development, Software Technology, and Exascale Systems to
improve the performance and usability of future HPC hardware platforms (holistic
codesign)
Mission need Objective
Hardware Thrust Scope
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Paul Messina-ANL
Capable exascale computing requires close coupling and coordination of key development and technology
R&D areas
Application Development
Software Technology
Hardware Technology
Exascale Systems
ECP
Integration and Co-Design is key
Impacts of Extreme Scale Computing - Oct 2017
Slide Courtesy of Paul Messina-ANL
“Non-recurring engineering” (NRE) activities will be
integral to next-generation computing hardware and
software.
Four key challenges will be addressed through
targeted R&D investments to bridge the capability
gap
Systems must meet ECP’s essential performance
parameters
Energy consumption
Reliability
Memory and storage
Parallelism
50 times the current performance
10 times reduction in power consumption
System resilience: 6 days without app failure
DOE Capable exascale systems by 2021-2023
Slide Courtesy of Paul Messina-ANL Impacts of Extreme Scale Computing - Oct 2017
Planned US Leadership Systems • 2019
• NSF Leadership Class Computing Facility Phase 1 • 2-3x sustained performance of Blue Waters
• ~2021 • DOE accelerated exascale systems to deliver “approximately 50x more
performance than today’s 20-petaflops machines on mission critical applications” • “at least one exascale system will be delivered in 2021 to a DOE Office of
Science Leadership Computing Facility (Argonne and/or Oak Ridge LCFs)” • ~2023-2024
• NSF Leadership Class Computing Facility – Phase 2 • 10-20x sustained performance of Phase 1
• DOE • An exascale system at a National Nuclear Security Administration (NNSA)
facility (LLNL or LANL)
Impacts of Extreme Scale Computing - Oct 2017
QUESTIONS
Impacts of Extreme Scale Computing - Oct 2017