astro-ph/9912202 -...

Post on 09-Mar-2018

219 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Office of Science

The present and future at OLCF

Bronson Messer Acting Group Leader Scientific Computing

November 28, 2012

2

Architectural Trends – No more free lunch •  CPU clock rates quit

increasing in 2003 •  P = CV2f

Power consumed is proportional to the frequency and to the square of the voltage

•  Voltage can’t go any lower, so frequency can’t go higher without increasing power

•  Power is capped by heat dissipation and $$$

•  Performance increases have been coming through increased parallelism Herb Sutter: Dr. Dobb’s Journal:

http://www.gotw.ca/publications/concurrency-ddj.htm

3

astro-ph/9912202

4

SYSTEM SPECIFICATIONS: • Peak performance of 27.1 PF

•  24.5 GPU + 2.6 CPU •  18,688 Compute Nodes each with:

•  16-Core AMD Opteron CPU • NVIDIA Tesla “K20x” GPU •  32 + 6 GB memory

•  512 Service and I/O nodes •  200 Cabinets •  710 TB total system memory • Cray Gemini 3D Torus Interconnect •  8.8 MW peak power

ORNL’s “Titan” Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors

4,352 ft2

404 m2

5

Cray XK7 Compute Node

Y  

X  

Z  

HT3 HT3

PCIe Gen2

XK7  Compute  Node  Characteris6cs  AMD  Opteron  6274  16  core  processor  

Tesla  K20x  @  1311  GF  

Host  Memory  32GB  

1600  MHz  DDR3  Tesla  K20x  Memory  

6GB  GDDR5  Gemini  High  Speed  Interconnect  

Slide courtesy of Cray, Inc.

6

Titan: Cray XK7 System

Board: 4 Compute Nodes 5.8 TF 152 GB

Cabinet: 24 Boards 96 Nodes 139 TF 3.6 TB

System: 200 Cabinets 18,688 Nodes 27 PF 710 TB

Compute Node: 1.45 TF 38 GB

7

Titan is an upgrade of Jaguar Phase 1: Replaced all of the Cray XT5 node boards with XK7 node boards, replaced fans, added power supplies & 3.3 MW transformer

Reused parts from Jaguar: • Cabinets • Backplanes • Interconnect cables • Power Supplies • Liquid Cooling System • RAS System • File System

Upgrade  saved  $25M  over  the  cost  of  a  new  

system!  

8

Flywheel  based  UPS  for  highest  efficiency  

Variable  Speed  Chillers  save  energy  

Liquid  Cooling  is  1,000  Wmes  more  efficient  than  air  cooling  

13,800  volt  power  into  the  building  saves  on  transmission  losses  

Titan’s Power & Cooling: Designed for Efficiency

Vapor barriers and positive air pressure keep humidity out of computer center

Result: With a PUE of 1.25, ORNL has one of the world’s most efficient data centers

480 volt power to computers saves $1M in installation costs and reduce losses

9

Why GPUs? High Performance and Power Efficiency on a Path to Exascale • Hierarchical parallelism – Improves scalability of applications • Exposing more parallelism through code refactoring and source

code directives • Heterogeneous multi-core processor architecture – Use the

right type of processor for each task. • Data locality – Keep the data near the processing. GPU has

high bandwidth to local memory for rapid access. GPU has large internal cache

• Explicit data management – Explicitly manage data movement between CPU and GPU memories.

10

How Effective are GPUs on Scalable Applications? OLCF-3 Early Science Codes Very early performance measurements on Titan

    XK7  (w/  K20x)    vs.  XE6    Cray  XK7:  K20x  GPU  plus  AMD  6274  CPU  Cray  XE6:  Dual  AMD  6274  and  no  GPU  

Cray  XK6  w/o  GPU:  Single  AMD  6274,  no  GPU  

Applica6on   Performance    Ra6o   Comments  

S3D   1.8   •  Turbulent  combusWon    •  6%  of  Jaguar  workload  

Denovo  sweep   3.8  

•  Sweep  kernel  of  3D  neutron  transport  for  nuclear  reactors  

•  2%  of  Jaguar  workload  

LAMMPS   7.4*  (mixed  precision)  

•  High-­‐performance  molecular  dynamics  •  1%  of  Jaguar  workload  

WL-­‐LSMS   1.6  •  StaWsWcal  mechanics  of  magneWc  materials  •  2%  of  Jaguar  workload  •  2009  Gordon  Bell  Winner  

CAM-­‐SE   1.5   •  Community  atmosphere  model  •  1%  of  Jaguar  workload  

11

Additional Applications from Community Efforts Current performance measurements on Titan or CSCS system

    XK7  (w/  K20x)  vs.  XE6    

Cray  XK7:  K20x  GPU  plus  AMD  6274  CPU  Cray  XE6:  Dual  AMD  6274  and  no  GPU  

Cray  XK6  w/o  GPU:  Single  AMD  6274,  no  GPU  

Applica6on   Performance    Ra6o   Comment  

NAMD   1.4  •  High-­‐performance  molecular  dynamics  •  2%  of  Jaguar  workload  

Chroma   6.1  •  High-­‐energy  nuclear  physics  •  2%  of  Jaguar  workload  

QMCPACK   3.0  •  Electronic  structure  of  materials  •  New  to  OLCF,  Common  to  

SPECFEM-­‐3D   2.5  •  Seismology  •  2008  Gordon  Bell  Finalist  

GTC   1.6  •  Plasma  physics  for  fusion-­‐energy  •  2%  of  Jaguar  workload  

CP2K   1.5  •  Chemical  physics  •  1%  of  Jaguar  workload  

12

Hierarchical Parallelism •  MPI parallelism between nodes (or PGAS) •  On-node, SMP-like parallelism via threads (or

subcommunicators, or…) •  Vector parallelism

•  SSE/AVX/etc on CPUs •  GPU threaded parallelism

•  Exposure of unrealized parallelism is essential to exploit all near-future architectures.

•  Uncovering unrealized parallelism and improving data locality improves the performance of even CPU-only code.

•  Experience with vanguard codes at OLCF suggests 1-2 person-years is required to “port” extant codes to GPU platforms.

•  Likely less if begun today, due to better tools/compilers

11010110101000 01010110100111 01110110111011

01010110101010

13

How do you program these nodes? • Compilers

–  OpenACC is a set of compiler directives that allows the user to express hierarchical parallelism in the source code so that the compiler can generate parallel code for the target platform, be it GPU, MIC, or vector SIMD on CPU

–  Cray compiler supports XK7 nodes and is OpenACC compatible –  CAPS HMPP compiler supports C, C++ and Fortran compilation for

heterogeneous nodes with OpenACC support –  PGI compiler supports OpenACC and CUDA Fortran

•  Tools –  Allinea DDT debugger scales to full system size and with ORNL support

will be able to debug heterogeneous (x86/GPU) apps –  ORNL has worked with the Vampir team at TUD to add support for

profiling codes on heterogeneous nodes –  CrayPAT and Cray Apprentice support XK6 programming

14

Unified x86/Accelerator Development Environment enhances productivity with a common look and feel

Cray  Compiling  Environment  

Cray  Scien6fic  &  Math  Libraries  

Cray  Performance  Monitoring    and  Analysis  Tools   Accelerators  

X86-­‐64  Cray  Message  Passing  Toolkit  

Cray  Debug  Support  Tools  

CUDA  SDK  

GNU  

Slide Courtesy of Cray Inc

15

Filesystems

•  The Spider center-wide file system is the operational work file system for most NCCS systems. It is a large-scale Lustre file system, with over 26,000 clients, providing 10.7 PB of disk space. It is also has a demonstrated bandwidth of 240 GB/s.

•  New Spider procurement is underway

•  roughly double the capacity •  roughly 4x BW

16

Three primary ways for access to LCFs Distribution of allocable hours

60% INCITE 4.7 billion core-hours in CY2013

30% ASCR Leadership Computing Challenge

10% Director’s Discretionary

Leadership-class computing

DOE/SC capability computing

17

LCFs support models for INCITE

•  “Two-pronged” support model is shared •  Specific organizational implementations differ slightly

but user perspective is virtually identical

ScienWfic  compuWng  

Liaisons  

VisualizaWon  

Performance  

End-­‐to-­‐end  workflows  

Data  analyWcs    and  visualizaWon  

Catalysts   Performance  engineering  

User  assistance  and  outreach   User  service  and  outreach  

18

Basics

• User Assistance group provides “front-line” support for day-to-day computing issues

• SciComp Liaisons provide advanced algorithmic and implementation assistance

• Assistance in data analytics and workflow management, visualization, and performance engineering are also provided for each project (both tasks are “housed” in SciComp at OLCF)

19

Most SciComp interactions fall into one of three bins •  “user support +”

–  The SciComp liaison answers basic questions when asked and serves as an internal advocate for the project at the OLCF

–  Constant “pings” from liaisons •  “rainmakers”

–  The SciComp liaison “parachutes in” and undertakes an short, intense burst of development activity to surmount a singular application problem

–  The usual duration is less than 2 months in wallclock time and is 1 FTE-month in effort

•  collaborators –  The SciComp liaison is a member of (in several cases, one of the

leaders of) the code development team –  Liaison is a co-author on scientific papers

20

Questions? bronson@ornl.gov

20

The research and activities described in this presentation were performed using the resources of the National Center for Computational Sciences at

Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy

under Contract No. DE-AC0500OR22725.

top related