astro-ph/9912202 -...
TRANSCRIPT
Office of Science
The present and future at OLCF
Bronson Messer Acting Group Leader Scientific Computing
November 28, 2012
2
Architectural Trends – No more free lunch • CPU clock rates quit
increasing in 2003 • P = CV2f
Power consumed is proportional to the frequency and to the square of the voltage
• Voltage can’t go any lower, so frequency can’t go higher without increasing power
• Power is capped by heat dissipation and $$$
• Performance increases have been coming through increased parallelism Herb Sutter: Dr. Dobb’s Journal:
http://www.gotw.ca/publications/concurrency-ddj.htm
3
astro-ph/9912202
4
SYSTEM SPECIFICATIONS: • Peak performance of 27.1 PF
• 24.5 GPU + 2.6 CPU • 18,688 Compute Nodes each with:
• 16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU • 32 + 6 GB memory
• 512 Service and I/O nodes • 200 Cabinets • 710 TB total system memory • Cray Gemini 3D Torus Interconnect • 8.8 MW peak power
ORNL’s “Titan” Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors
4,352 ft2
404 m2
5
Cray XK7 Compute Node
Y
X
Z
HT3 HT3
PCIe Gen2
XK7 Compute Node Characteris6cs AMD Opteron 6274 16 core processor
Tesla K20x @ 1311 GF
Host Memory 32GB
1600 MHz DDR3 Tesla K20x Memory
6GB GDDR5 Gemini High Speed Interconnect
Slide courtesy of Cray, Inc.
6
Titan: Cray XK7 System
Board: 4 Compute Nodes 5.8 TF 152 GB
Cabinet: 24 Boards 96 Nodes 139 TF 3.6 TB
System: 200 Cabinets 18,688 Nodes 27 PF 710 TB
Compute Node: 1.45 TF 38 GB
7
Titan is an upgrade of Jaguar Phase 1: Replaced all of the Cray XT5 node boards with XK7 node boards, replaced fans, added power supplies & 3.3 MW transformer
Reused parts from Jaguar: • Cabinets • Backplanes • Interconnect cables • Power Supplies • Liquid Cooling System • RAS System • File System
Upgrade saved $25M over the cost of a new
system!
8
Flywheel based UPS for highest efficiency
Variable Speed Chillers save energy
Liquid Cooling is 1,000 Wmes more efficient than air cooling
13,800 volt power into the building saves on transmission losses
Titan’s Power & Cooling: Designed for Efficiency
Vapor barriers and positive air pressure keep humidity out of computer center
Result: With a PUE of 1.25, ORNL has one of the world’s most efficient data centers
480 volt power to computers saves $1M in installation costs and reduce losses
9
Why GPUs? High Performance and Power Efficiency on a Path to Exascale • Hierarchical parallelism – Improves scalability of applications • Exposing more parallelism through code refactoring and source
code directives • Heterogeneous multi-core processor architecture – Use the
right type of processor for each task. • Data locality – Keep the data near the processing. GPU has
high bandwidth to local memory for rapid access. GPU has large internal cache
• Explicit data management – Explicitly manage data movement between CPU and GPU memories.
10
How Effective are GPUs on Scalable Applications? OLCF-3 Early Science Codes Very early performance measurements on Titan
XK7 (w/ K20x) vs. XE6 Cray XK7: K20x GPU plus AMD 6274 CPU Cray XE6: Dual AMD 6274 and no GPU
Cray XK6 w/o GPU: Single AMD 6274, no GPU
Applica6on Performance Ra6o Comments
S3D 1.8 • Turbulent combusWon • 6% of Jaguar workload
Denovo sweep 3.8
• Sweep kernel of 3D neutron transport for nuclear reactors
• 2% of Jaguar workload
LAMMPS 7.4* (mixed precision)
• High-‐performance molecular dynamics • 1% of Jaguar workload
WL-‐LSMS 1.6 • StaWsWcal mechanics of magneWc materials • 2% of Jaguar workload • 2009 Gordon Bell Winner
CAM-‐SE 1.5 • Community atmosphere model • 1% of Jaguar workload
11
Additional Applications from Community Efforts Current performance measurements on Titan or CSCS system
XK7 (w/ K20x) vs. XE6
Cray XK7: K20x GPU plus AMD 6274 CPU Cray XE6: Dual AMD 6274 and no GPU
Cray XK6 w/o GPU: Single AMD 6274, no GPU
Applica6on Performance Ra6o Comment
NAMD 1.4 • High-‐performance molecular dynamics • 2% of Jaguar workload
Chroma 6.1 • High-‐energy nuclear physics • 2% of Jaguar workload
QMCPACK 3.0 • Electronic structure of materials • New to OLCF, Common to
SPECFEM-‐3D 2.5 • Seismology • 2008 Gordon Bell Finalist
GTC 1.6 • Plasma physics for fusion-‐energy • 2% of Jaguar workload
CP2K 1.5 • Chemical physics • 1% of Jaguar workload
12
Hierarchical Parallelism • MPI parallelism between nodes (or PGAS) • On-node, SMP-like parallelism via threads (or
subcommunicators, or…) • Vector parallelism
• SSE/AVX/etc on CPUs • GPU threaded parallelism
• Exposure of unrealized parallelism is essential to exploit all near-future architectures.
• Uncovering unrealized parallelism and improving data locality improves the performance of even CPU-only code.
• Experience with vanguard codes at OLCF suggests 1-2 person-years is required to “port” extant codes to GPU platforms.
• Likely less if begun today, due to better tools/compilers
11010110101000 01010110100111 01110110111011
01010110101010
13
How do you program these nodes? • Compilers
– OpenACC is a set of compiler directives that allows the user to express hierarchical parallelism in the source code so that the compiler can generate parallel code for the target platform, be it GPU, MIC, or vector SIMD on CPU
– Cray compiler supports XK7 nodes and is OpenACC compatible – CAPS HMPP compiler supports C, C++ and Fortran compilation for
heterogeneous nodes with OpenACC support – PGI compiler supports OpenACC and CUDA Fortran
• Tools – Allinea DDT debugger scales to full system size and with ORNL support
will be able to debug heterogeneous (x86/GPU) apps – ORNL has worked with the Vampir team at TUD to add support for
profiling codes on heterogeneous nodes – CrayPAT and Cray Apprentice support XK6 programming
14
Unified x86/Accelerator Development Environment enhances productivity with a common look and feel
Cray Compiling Environment
Cray Scien6fic & Math Libraries
Cray Performance Monitoring and Analysis Tools Accelerators
X86-‐64 Cray Message Passing Toolkit
Cray Debug Support Tools
CUDA SDK
GNU
Slide Courtesy of Cray Inc
15
Filesystems
• The Spider center-wide file system is the operational work file system for most NCCS systems. It is a large-scale Lustre file system, with over 26,000 clients, providing 10.7 PB of disk space. It is also has a demonstrated bandwidth of 240 GB/s.
• New Spider procurement is underway
• roughly double the capacity • roughly 4x BW
16
Three primary ways for access to LCFs Distribution of allocable hours
60% INCITE 4.7 billion core-hours in CY2013
30% ASCR Leadership Computing Challenge
10% Director’s Discretionary
Leadership-class computing
DOE/SC capability computing
17
LCFs support models for INCITE
• “Two-pronged” support model is shared • Specific organizational implementations differ slightly
but user perspective is virtually identical
ScienWfic compuWng
Liaisons
VisualizaWon
Performance
End-‐to-‐end workflows
Data analyWcs and visualizaWon
Catalysts Performance engineering
User assistance and outreach User service and outreach
18
Basics
• User Assistance group provides “front-line” support for day-to-day computing issues
• SciComp Liaisons provide advanced algorithmic and implementation assistance
• Assistance in data analytics and workflow management, visualization, and performance engineering are also provided for each project (both tasks are “housed” in SciComp at OLCF)
19
Most SciComp interactions fall into one of three bins • “user support +”
– The SciComp liaison answers basic questions when asked and serves as an internal advocate for the project at the OLCF
– Constant “pings” from liaisons • “rainmakers”
– The SciComp liaison “parachutes in” and undertakes an short, intense burst of development activity to surmount a singular application problem
– The usual duration is less than 2 months in wallclock time and is 1 FTE-month in effort
• collaborators – The SciComp liaison is a member of (in several cases, one of the
leaders of) the code development team – Liaison is a co-author on scientific papers
20
Questions? [email protected]
20
The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at
Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy
under Contract No. DE-AC0500OR22725.