b5: exascale hardware. capability requirements several different requirements –exaflops/exascale...

11
B5: Exascale Hardware

Upload: dominic-lane

Post on 19-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

B5: Exascale Hardware

Capability Requirements

• Several different requirements– Exaflops/Exascale single application– Ensembles of Petaflop apps requiring

Exaflops-years– Streaming/Realtime– I/O intensive (e.g., analysis, data mining)

• Not considering capacity

Exaflops are Possible

• Extrapolation of Top500 suggests that 1EF in 2019

• DOE (through ASCI and LCF) has contributed to staying on this trajectory– May require investment to

stay on this trajectory– History shows Federal

investment accelerated top systems

– May not get usable FLOPS (non LINPACK) without investment

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Components of an Exascale System

• Its not just FLOPS. Need– Processors– Interconnect– Memory– I/O (persistent storage)– Connection to the outside world– Balance of these

• Constraints Include– Power– Cooling– Reliability– Adoption by applications, particularly legacy, and including familiar

development environment– Cost :)

Example Commodity Design

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Notes on Commodity Design

• Based on Jeff Vetter’s extrapolation of current technology– Details in ORNL presentation

• Does not preserve the performance ratios (e.g., bytes/flop interconnect bandwidth) commonly expected– This is not new; e.g., PC memory/disk size ratios have

changed significantly

• Most (all?) Exascale system designs will mandate some changes in those ratios– R&D can either reduce the change in the ratio or reduce the

impact of the change (e.g., new algorithms)– E.g., more specialized systems may provide better cost/perf

for specific application classes

Issues (concerns)

• There are possible hazards:– Interconnect performance

• Latency, bandwidth

– I/O • Density, bandwidth, fault management

– Memory• Cost, power (and latency and bandwidth)

– Power• 4M PS3 is 1EF but use 1GW

– Latency/bandwidth/faults/concurrency– Software and algorithms

• Workaround/with latency/bandwidth/faults/concurrency

• Non issue - getting the peak FLOPS• All of these can (must) benefit from research and development

investment

Alternate Directions

• Commodity– GPGPU and STI Cell offer very high compute density wrt

commodity CPU– Ex. 4M PS3 = 1EF (single precision)– But

• Not all algorithms can effectively use these systems• Programming complexity (currently) much greater

– Embedded processors (better FLOPS/Watt)

• New Architectures– PIM, FPGA-centric, …

• Not in this time frame– Quantum, molecular, DNA, …

Promising Tech

• Tech that can improve balance (ratios) in system; cost, reliability, etc.

• Optimizing the use of die space for CPU (manycore, multicore, stream, vector, heterogeneous, variable precision arithmetic, etc.)

• Optical network (faster signaling, cheaper/denser connectors)• Optical into/out of the processor• 3-D chips, integrated memory/processor• Faster development of customized processors• Hardware accelerated system verification (e.g., RAMP)• NAND Flash, MRAM, and other non-volatile memory (disk

replacements)• Myriad approaches to power efficiency

Cross Cutting Issues

• Better characterization of algorithm requirements wrt system ratios• New algorithms to match system ratios

– Disk I/O/main memory– Interconnect bandwidth/flops– Etc

• New algorithms/software to detect and handle faults• New approaches to algorithms/software for specialized/disruptive processor

architectures– E.g., good ways to move apps to GPGPUs, PIMs, or FPGAs

• Need to accelerate applications and algorithms (esp. new ones) to PF now to prepare for EF

• Programming Language and Environments– PGAS, Domain-specific, auto-tuner, hierarchical programming models (built on current

models)– Interaction with hardware (e.g., user-managed caches, remote atomic updates, etc.)– Performance modeling and debugging– Productivity etc.– System software, OS (e.g., memory management)

Sample Plan Components

• Point studies for future– Like the Petaflops point designs, with more application/algorithm designer involvement and

include OS. Evaluate time/cost to get apps running on system. Ongoing process; contrast with baseline

• Early simulation and modeling of systems, algorithms, and applications (see open source below) incl hardware (e.g., RAMP), particularly wrt promising technologies

• Evaluate special purpose architectures and non-MPI programming models for application/algorithm classes (cheaper, faster, better)

• Partnerships for disruptive technologies– Need to understand timeline and costs– Goal is to accelerate; not required for Exaflops

• Directed vendor partnerships– QCDOC is a good example

• Support application involvement from the beginning – WRT point designs, with performance understanding– Must encourage new apps to increase community size

• Some Principles– Open source– Support multiple prototypes (at suitable scale)– Establish a framework to move from point studies to full systems through multiple stages