creating an exascale ecosystem for science
TRANSCRIPT
ORNL is managed by UT-Battelle
for the US Department of Energy
Creating an Exascale
Ecosystem for Science
Presented to:
HPC Saudi 2017
Jeffrey A. NicholsAssociate Laboratory DirectorComputing and Computational Sciences
March 14, 2017
2
Our vision
Sustain leadership and scientific impact in computing and computational sciences
• Provide world’s most powerful open resources for scalable computing and simulation, data and analytics at any scale, and scalable infrastructure for science
• Follow a well-defined path for maintaining world leadership in these critical areas
• Attract the brightest talent and partnerships from all over the world
• Deliver leading-edge science relevant to missionsof DOE and key federal and state agencies
• Invest in cross-cutting partnerships with industry
• Provide unique opportunity for innovation based on multiagency collaboration
• Invest in education and training
3
Gaea
Titan
Oak Ridge Leadership Computing Facility
(OLCF) is one of the world’s most powerful
computing facilities
Beacon
Darter
Peak performance 27 PF/s Data storage
• Spider file system− 40 PB capacity
− >1 TB/s bandwidth
• HPSS archive− 240 PB capacity
− 6 tape libraries
Memory 710 TB
Disk bandwidth 240 GB/s
Square feet 5,000
Power 8.8 MW
Peak performance 1.1 PF/s Data analytics/visualization
• LENS cluster
• Ewok cluster
• EVEREST visualization facility
• uRiKA data appliance
Memory 240 TB
Disk bandwidth 104 GB/s
Square feet 1,600
Power 2.2 MW
Peak performance 210 TF/s
Memory 12 TB
Disk bandwidth 56 GB/s
Peak performance 240.9 TF/s
Memory 22.6 TB
Disk bandwidth 30 GB/s
Networks
• ESnet, 100 Gbps
• Internet2, 100 Gbps
• Private dark fibre
4
Our Compute and Data Environment for Science (CADES) provides a shared infrastructure to help solve big science problems
CADES
Infrastructure(File systems/ Networking / etc.)
Condos, Clusters, and
Hybrid Clouds
XK7
FutureTechnologies,
Beyond Moore’s Law
Shared-Memory
Ultraviolet(UV)
GraphAnalyticsCray GX
Spallation Neutron Source
Center for Nanophase Materials Sciences
Atmospheric Radiation
Measurement
Basic Energy
Sciences
Leadership Computing
Facility
ALICE,etc.
UT-CADES
5
Empirical/experimental investigation
CADES connects the modes of
discovery
3 ExecutionForward process in simulation or iteration for convergence
2 FormulationMapping to solvers and computing
1 ModelCapturing the physical processes
C Analytics at scaleMachine learning
B AlignmentData capture into staged structures
A Experiment designSynthesis or control
In silico investigation
6
CADES Deployment
OIC
Cray Condos
CADES Moderate
CADES Open
Hybrid Cloud
Unique Heterogeneous Platforms
Large-Scale Storage
PHI Enclave
High-Speed Interconnects
• ~6000 Cores of Integrated Condos on Infiniband• ~5000 Cores of Hybrid, Expandable Cloud • SGI UV, Urika-GD/XA: GX• 5PB+ High-Speed Storage• ~3000 Cores of XK7
• ~5000 Cores of Integrated Condos on Infiniband• ~10,000 OIC Cores• Attested PHI Enclave• Integrated with UCAMS and XCAMS
.. and several other smaller projects... and several ORNL projects on OIC
Object store
7
Beam user tier
High speed secure data transfer
Scientific instrument tier
Big Compute + Analytics (OLCF and CADES)
coupled to Big Science Data
Scanning Transmission
Electron Microscopy
(STEM)
Scanning Tunneling
Microscopy (STM)
Scanning Probe
Microscopy (SPM)
IFIR/CNMS resources
CADES
Supercomputing tier
Storage
MySQL Database
Data/artifacts
BEAM web and data tier Cades cluster computing
HTTPS
Local
8
DOE HPC cloud
Titan, Edison,and Hopper
CADEScompute clusters
Scanning probe
microscope
CADESdata
storage
CADES VM web/data server
Distributed cloud-based architecture
9
Big data Analyzing and managing large complex
data sets from experiments, observation, or simulation and sharing
them with a community
ORNL’s computing ecosystem must
integrate data analysis and simulation
capabilities
• Simulation and data are critical to DOE
• Both need more computing capability
• Both have similar hardware technology requirements
– High bandwidth to memory
– Efficient processing
– Very fast I/O
• Different machine balance may be required
Experiment Theory
Computing
Simulation Used to implement theory; helps with understanding
and prediction
10
2017 OLCF Leadership System
• Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies®
• At least 5X Titan’s Application Performance
• Total System Memory >6 PBHBM, DDR, and non-volatile
• Dual-rail Mellanox® Infiniband full, non-blocking fat-tree interconnect
• IBM Elastic Storage (GPFS™) – 2.5 TB/s I/O and 250 PB disk capacity
Approximately 4,600 nodes, each with:
• Multiple IBM POWER9 CPUs and multiple NVIDIA Tesla® GPUs using the NVIDIA Volta architecture
• CPUs and GPUs connected with high speed NVLink
• Large coherent memory: over512 GB (HBM + DDR4)
– all directly addressable from the CPUs and GPUs
• An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory
• over 40 TF peak performance
Hybrid CPU/GPU
architecture
“The Smartest Supercomputer on the Planet”
11
Summit will replace Titan as the OLCF’s
leadership supercomputer
• Many fewer nodes
• Much more powerful nodes
• Much more memory per node and total system memory
• Faster interconnect
• Much higher bandwidth between CPUs and GPUs
• Much larger and faster file system
Feature Titan Summit
Application
performanceBaseline 5-10x Titan
Number of nodes 18,688 ~4,600
Node performance 1.4 TF > 40 TF
Memory per node38GB DDR3 + 6GB
GDDR5512 GB DDR4 + HBM
NV memory per node 0 800 GB
Total system memory 710 TB>6 PB DDR4 + HBM +
non-volatile
System interconnect
(node injection
bandwidth)
Gemini (6.4 GB/s)Dual Rail EDR-IB (23 GB/s)
Or Dual Rail HDR-IB (48 GB/s)
Interconnect topology 3D Torus Non-blocking Fat Tree
Processors1 AMD Opteron™
1 NVIDIA Kepler™
2 IBM POWER9™
6 NVIDIA Volta™
File system 32 PB, 1 TB/s, Lustre® 250 PB, 2.5 TB/s, GPFS™
Peak power
consumption9 MW 13 MW
12
ECP aims to transform the HPC ecosystem
and make major contributions to the nation
• Develop applications that will tackle a broad spectrum of mission critical problems of unprecedented complexity with unprecedented performance
• Contribute to the economic competitiveness of the nation
• Support national security
• Develop a software stack, in collaboration with vendors, that is exascale-capable and is usable on smaller systems by industry and academia
• Train a large cadre of computational scientists, engineers, and computer scientists who will be an asset to the nation long after the end of ECP
• Partner with vendors to develop computer architectures that support exascaleapplications
• Revitalize the US HPC vendor industry
• Demonstrate the value of comprehensive co-design
13
The ECP Plan of Record
• A 7-year project that follows the holistic/co-design approach, which runs through 2023 (including 12 months of schedule contingency)
• Enable an initial exascale system based on advanced architecture and delivered in 2021
• Enable capable exascale systems, based on ECP R&D, delivered in 2022 and deployed in 2023 as part of an NNSA and SC facility upgrades
• Acquisition of the exascale systems is outside of the ECP scope, will be carried out by DOE-SC and NNSA-ASC supercomputing facilities
14
Transition to higher trajectory with
advanced architecture
Time
Computing Capability
2017 2021 2022 2023 2024 2025 2026 2027
10X
5X
First exascale
advanced
architecture
system
Capable
exascale
systems
15
Reaching the elevated trajectory will require
advanced and innovative architectures
In order to reach the elevated trajectory, advanced architectures must be developed that make a big leap in
– Parallelism
– Memory and Storage
– Reliability
– Energy Consumption
In addition, the exascale advanced architecture will need to solve emerging data science and machine learning problems in addition to the traditional modeling and simulations
applications.
The exascale advanced architecture developments
benefit all future U.S. systems on the higher trajectory
16
ECP follows a holistic approach that uses
co-design and integration to achieve capable
exascale
Application Development
SoftwareTechnology
HardwareTechnology
ExascaleSystems
Scalable and productive software
stack
Science and mission
applications
Hardware technology elements
Integrated exascale
supercomputers
Correctness VisualizationData
Analysis
Applications Co-Design
Programming models,
development environment, and
runtimes
ToolsMath
libraries and Frameworks
System Software, resource
management threading,
scheduling, monitoring, and
control
Memory and
Burst buffer
Data managem
ent I/O and file system
Node OS, runtimes
Resili
ence
Work
flow
s
Hardware interface
ECP’s work encompasses applications, system software, hardware
technologies and architectures, and workforce development
17
Planned outcomes of the ECP
• Important applications running at exascale in 2021, producing useful results
• A full suite of mission and science applications ready to run on the 2023 capable exascale systems
• A large cadre of computational scientists, engineers, and computer scientists with deep expertise in exascale computing, who will be an asset to the nation long after the end of ECP
• An integrated software stack that supports exascale applications
• Results of PathForward R&D contracts with vendors that are integrated into exascale systems and are in vendors’ product roadmaps
• Industry and mission critical applications have been prepared for a more diverse and sophisticated set of computing technologies, carrying U.S. supercomputing well into the future
18
Titan and beyond: hierarchical parallelism
with very powerful nodes
Jaguar
2.3 PFMulti-core CPU
7 MW
Titan:27 PF
Hybrid GPU/CPU9 MW
2010 2012 2017 2021-2022
OLCF-5
5–10× Summit~20-50 MW
Summit
200PF TitanHybrid GPU/CPU
13 MW
The Oak Ridge Leadership Computing
Facility is on a well-defined path to
exascale
Jaguar scaled to 300,000 cores
MPI plus thread-level parallelism through OpenACC or OpenMP plus vectors
Since clock-rate scaling ended in 2003: HPC performance achieved
through increased parallelism
19
Summary
• ORNL has a long history in high-performance computing for science, delivering many first-of-a-kind systems that were among the world’s most powerful computers. We will continue this as a core-competency of the laboratory
• Delivering an ecosystem focused on the integration of computing and data into instruments of science and engineering
• This ecosystem delivers important, time-critical science with enormous impacts
20
Questions?