darpa's ubiquitous high-performance computing (uhpc ...parihar/pres/pres_darpa.pdf · also...

23
DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects Presented by Raj Parihar Advanced Computer Architecture Lab University of Rochester, Rochester Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects

Upload: vuminh

Post on 18-Feb-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

DARPA’s Ubiquitous High-PerformanceComputing (UHPC)/ Exascale Projects

Presented by Raj Parihar

Advanced Computer Architecture Lab

University of Rochester, Rochester

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects

Page 2: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

References

Runnemede: An Architecture for Ubiquitous High-Performance ComputingIntel Labs, UIUC, Reservior Labs (HPCA’13)

The MIT Angstrom ProjectMIT, UMCP (HotPar’11)

GPUs and the Future of Parallel ComputingNVIDIA (IEEE Micro’11)

Sandia’s X-Caliber ProjectSandia Lab, Micron, LexisNexis, 8 academic partners

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 2

Page 3: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

DARPA’s Exascale Challenge

Build an exascale machine by 2020 using today’s technology

Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)

DARPA’s challenge and design goals:

1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)

Design the whole stack to be energy efficient – from software and

programming model to low power circuits and transistors

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 3

Page 4: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

DARPA’s Exascale Challenge

Build an exascale machine by 2020 using today’s technology

Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)

DARPA’s challenge and design goals:

1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)

Design the whole stack to be energy efficient – from software and

programming model to low power circuits and transistors

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 3

Page 5: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

DARPA’s Exascale Challenge

Build an exascale machine by 2020 using today’s technology

Current best (Top 500 org - June’13): Tianhe-2Speed: 33.86 petaflops on the Linpack benchmarkPower: 17.6 MW (24 MW with cooling)

DARPA’s challenge and design goals:

1018 operations per second in 20MW power budgetAchieve energy efficiency of 50 GigaOps/WattEnergy efficiency of 100-1000x compared to current systemsNo constraints of backward compatibilityAssume the technology/packaging of 2018 - 2020 (10nm node)

Design the whole stack to be energy efficient – from software and

programming model to low power circuits and transistors

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 3

Page 6: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Runnemede: High Performance System (HPCA’13)

Hierarchical, heterogeneous, near-threshold computing

Overprovisioned, support for selective execution and power down

No hardware cache-coherence, managed in software

Dataflow kind of execution: tasks are known as codelets

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 4

Page 7: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Runnemede: Block Architecture

Control Engine (CE): execute OS/runtime code, perform I/O

Execution Engine (XE): simple in-order cores, execute codelets

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 5

Page 8: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Runnemede: Chip Architecture

Chip consist of a total of 576 cores, three-level of hierarchy

Implements a physical address space, with no virtual memory

Fine-grain DVFS, power and clock gating to save energy

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 6

Page 9: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Runnemede: Network Topology

Contains two independent hierarchical networks:

A data network and a barrier/reduction network

Hierarchical network allows Runnemede to provide tapered BW

Efficient short-distance communication

Also leverage the insight that relatively high-radix switchesreduce the overall network energyThree options: Fat-tree, Hybrid-tree, Pruned-tree

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 7

Page 10: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

HW-SW Co-design: Optimization for SAR

Benchmark: streaming sensor application based SARInput (set of vectors), output (image of reflected energy of points)

ISAopt: added sin-cos instruction to the ISA

TrigOpt: single- precision of each pixel is replaced by double-

precision of a subset of pixels and interpolation for remaining

Blocking: Each codelet copies input array into L1 scratchpad than

fetching values from DRAM

CompilerOpt: Skips few address calculations, strength reductions

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 8

Page 11: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Effect of Technology Scaling

Computation energy scales well: 77% redution (45nm to 10nm)

Network energy only decreases by 51%

Memory energy also decreases drastically, primarily due to use of

stacked DRAM

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 9

Page 12: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Network Analysis

A hybrid-tree with tapering of BW is a better choice compared to

fat-tree (energy inefficient) and pruned-tree (low bisection BW)

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 10

Page 13: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Evaluation of Scratchpad Memories

Matrix multiplication: memory energy breakdown

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 11

Page 14: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

MIT Angstrom Project

Led by Anant Agrawal; team includes MIT, MTL, RLE, MPhC labs

Freescale Semiconductor, Mercury Systems, Lockheed ATL

University of Maryland

Major challenges and research topics under exploration:

Ultra low voltage SRAM designA hierarchical cache-coherency protocol with distributeddiscretionary directories and dataThe Zettabricks System: is a language, compiler and runtimesystem for automatic parallel code generationSelf-Aware Factored Operating System (sefos): SEFOS is aself-aware OS targeted for 1000+ core systemsHelper threads: Exascale computers will have 1000s of cores.Unused cores can be used for prefetching, early branch resolutionThe SEEC Framework and Decision Engine

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 12

Page 15: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

MIT Angstrom Project

Led by Anant Agrawal; team includes MIT, MTL, RLE, MPhC labs

Freescale Semiconductor, Mercury Systems, Lockheed ATL

University of Maryland

Major challenges and research topics under exploration:

Ultra low voltage SRAM designA hierarchical cache-coherency protocol with distributeddiscretionary directories and dataThe Zettabricks System: is a language, compiler and runtimesystem for automatic parallel code generationSelf-Aware Factored Operating System (sefos): SEFOS is aself-aware OS targeted for 1000+ core systemsHelper threads: Exascale computers will have 1000s of cores.Unused cores can be used for prefetching, early branch resolutionThe SEEC Framework and Decision Engine

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 12

Page 16: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Helper Thread in Exascale Machine

Some apps may lack parallelism to keep all the cores busySome applications may also incur parallelization overheads

Communication and synchronization – that outweigh the benefits ofexploiting large-scale parallelism

One solution: Use few cores for load and branch “pre-execution”

Key challenges and topics to explore:In a 1000 core machine helper threads are physically distributed.How does this effect generation of effective helper thread code?What is the right proportion of helper threads to compute threads toachieve the best performance and power efficiency?How should the operating system schedule helper versus computethreads to maximize benefit while minimizing resource contention?Can helper threads run on extremely low-power cores to achievevery high power-efficiency yet still provide effective memory andbranch latency tolerance?

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 13

Page 17: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Helper Thread in Exascale Machine

Some apps may lack parallelism to keep all the cores busySome applications may also incur parallelization overheads

Communication and synchronization – that outweigh the benefits ofexploiting large-scale parallelism

One solution: Use few cores for load and branch “pre-execution”Key challenges and topics to explore:

In a 1000 core machine helper threads are physically distributed.How does this effect generation of effective helper thread code?What is the right proportion of helper threads to compute threads toachieve the best performance and power efficiency?How should the operating system schedule helper versus computethreads to maximize benefit while minimizing resource contention?Can helper threads run on extremely low-power cores to achievevery high power-efficiency yet still provide effective memory andbranch latency tolerance?

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 13

Page 18: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Low Power Partner Cores in Multicore (HotPar’11)

Main core generates events and places them in the event queue

Partner core serves these events based on their priorities

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 14

Page 19: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Case Study: Memory Prefetching (EM3D)

Each core issues 1 inst per cycle; Main core - 1 GHz

Speedup: upto 2.7x; Power efficiency (perf/watt): 2.2x

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 15

Page 20: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Echelon: A research GPU architecture

NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek

Khailany, Michael Garland, David Glasco

GPUs and the Future of Parallel Computing (IEEE Micro’11)

The state-of-art GPU-based high throughput computing system

How to scale GPU based architecture to meet exascale demand

At 10nm in 2017: GPUs will no longer be an external accelerator

to a CPU; instead, CPUs and GPUs will be integrated on the

same die with a unified memory architecture.The Throughput-Optimized Core architectures goals:

Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 16

Page 21: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Echelon: A research GPU architecture

NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek

Khailany, Michael Garland, David Glasco

GPUs and the Future of Parallel Computing (IEEE Micro’11)

The state-of-art GPU-based high throughput computing system

How to scale GPU based architecture to meet exascale demand

At 10nm in 2017: GPUs will no longer be an external accelerator

to a CPU; instead, CPUs and GPUs will be integrated on the

same die with a unified memory architecture.

The Throughput-Optimized Core architectures goals:

Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 16

Page 22: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Echelon: A research GPU architecture

NVIDIA led group; Stephen W. Keckler, William J. Dally, Brucek

Khailany, Michael Garland, David Glasco

GPUs and the Future of Parallel Computing (IEEE Micro’11)

The state-of-art GPU-based high throughput computing system

How to scale GPU based architecture to meet exascale demand

At 10nm in 2017: GPUs will no longer be an external accelerator

to a CPU; instead, CPUs and GPUs will be integrated on the

same die with a unified memory architecture.The Throughput-Optimized Core architectures goals:

Extreme energy efficiency by eliminating as many instructionoverheads as possibleMemory locality at multiple levels, andEfficient execution for instruction-level parallelism (ILP), data-levelparallelism (DLP), and fine-grained task-level parallelism (TLP).

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 16

Page 23: DARPA's Ubiquitous High-Performance Computing (UHPC ...parihar/pres/Pres_DARPA.pdf · Also leverage the insight that relatively high-radix switches ... Presented by Raj Parihar DARPA’s

Sandia’s UHPC X-Caliber Project

Sandia led team with Micron, LexisNexis, 8 academic partners

Simple pipeline of some sort: Wide access(?), Multithreaded

Scratchpad vs cache: Shared w/ registers? globally addressable?

Instruction encoding: Compressed? Contains dataflow state?

Composition of stack (optics? memory? logic?)

Thermal Migration: Move computation around to keep chip within

thermal bounds

Codelet/static dataflow modelAggressive architecture focusing on the data movement problem

Vast design spaceIterative application-driven co-design process

Presented by Raj Parihar DARPA’s Ubiquitous High-Performance Computing (UHPC)/ Exascale Projects 17