exascale computing a fact or a fiction? - ipdps · 1 exascale computing— a fact or a fiction?...

Exascale Computing—

a fact or a fiction?

Shekhar Borkar

Intel Corp.

May 21, 2013

This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the

official policies, either expressed or implied, of the U.S. Government.

IPDPS 2013

Outline

Compute roadmap & technology outlook

Challenges & solutions for:

– Compute,

– Memory,

– Interconnect,

– Resiliency, and

– Software stack

Summary

Compute Performance Roadmap

1.E-04

1.E-02

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1960 1970 1980 1990 2000 2010 2020

12 Years 11 Years 10 Years

Client

Hand-held

From Giga to Exa, via Tera & Peta

1986 1996 2006 2016

30X 250X

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

4,000X

Concurrency

2.5M X

Processor Performance

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

4,000X

System performance increases faster

Parallelism continues to increase

Power & energy challenge continues

Where is the Energy Consumed?

100pJ per FLOP 100W

Decode and control Address translations… Power supply losses Bloated with inefficient architectural features

Compute

Memory

10TB disk @ 1TB/disk @10W

0.1B/FLOP @ 1.5nJ per Byte

100pJ com per FLOP

~5W ~3W 5W

Teraflop system today

The UHPC* Challenge

20 pJ/Operation

20MW, Exa 20W, Tera

20KW, Peta

*DARPA, Ubiquitous HPC Program

20 mW, Mega 2W, 100 Giga

20 mW, Giga

Technology Scaling Outlook

45nm 32nm 22nm 14nm 10nm 7nm 5nm

Transistor Density

1.75 – 2X

Frequency

Almost flat

Supply Voltage

Almost flat

lative

Energy

Some scaling

Energy per Compute Operation

45nm 32nm 22nm 14nm 10nm 7nm

pJ/bit Com

pJ/bit DRAM

DP RF Op

pJ/DP FP

Communication

Operands

75 pJ/bit

25 pJ/bit

10 pJ/bit

100 pJ/bit

Source: Intel

Voltage Scaling

0.3 0.5 0.7 0.9

Vdd (Normal)

Total Power Leakage

Energy Efficiency

When designed to voltage scale

Near Threshold-Voltage (NTV)

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Supply Voltage (V)

65nm CMOS, 50°C

320mV S

65nm CMOS, 50°C

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Supply Voltage (V)

H. Kaul et al, 16.6: ISSCC08

Subthreshold Leakage at NTV

Leakag

100% Vdd

75% Vdd

50% Vdd

40% Vdd Increasing

Variations

NTV operation reduces total power, improves energy efficiency

Subthreshold leakage power is substantial portion of the total

Mitigating Impact of Variation

1.Variation control with body biasing

Body effect is substantially reduced in advanced

technologies

Energy cost of body biasing could become substantial

Fully-depleted transistors have no body left

Example: Many-core System

f/4 f/4

Running all cores at full frequency exceeds energy budget

Run cores at the native frequency Law of large numbers—averaging

2. Variation tolerance at the system level

Experimental NTV Processor

IA-32 Core

L1$-I L1$-D

Level Shifters + clk spine

1.1 mm

Custom Interposer 951 Pin FCBGA Package

Legacy Socket-7 Motherboard

Technology 32nm High-K Metal Gate

Interconnect 1 Poly, 9 Metal (Cu)

Transistors 6 Million (Core)

Core Area 2mm2

S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012

Wide Dynamic Range

VOLTAGE ZERO MAX

Demonstrated

Normal operating range Subthreshold

Ultra-low Power Energy Efficient High Performance

280 mV 0.45 V 1.2 V

3 MHz 60 MHz 915 MHz

2 mW 10 mW 737 mW

1500 Mips/W 5830 Mips/W 1240 Mips/W

Observations

Sub-Vt NTV FullVdd

Mem Lkg

Mem Dyn

Logic Lkg

Logic Dyn

Leakage power dominates Fine grain leakage power management is required

Integration of Power Delivery

For efficiency and management

Standard OLGA Packaging

Technology

BOTTOM

Converter

chipLoad chip

Inductors

CapacitorsOutput

Capacitors

Launch

Integrated Voltage Regulator Testchip

0 5 10 15 20

Load Current [A]E

L = 1.9nH

L = 0.8nH2.4V to 1.5V

2.4V to 1.2V

100MHz80MHz

Schrom et al, “A 100MHz 8-Phase Buck Converter Delivering 12A in 25mm2 Using Air-Core Inductors”, APEC 2007

Power delivery closer to the load for 1. Improved efficiency 2. Fine grain power management

Fine-grain Power Management

Dynamic Chip Level

STANDBY:

• Memory retains data

• 50% less power/tile

FULL SLEEP:

• Memories fully off

• 80% less power/tile

Dynamic, within a core

21 sleep regions per tile

Router

Data Memory

Instruction Memory

FP Engine 1 Sleeping: 90% less power

FP Engine 2 Sleeping: 90% less power

Router Sleeping: 10% less power (stays on to pass traffic)

Data Memory Sleeping: 57% less power

Instruction Memory Sleeping: 56% less power

17 Vangal et al, An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS, JSSC, January 2008

Mode Power Saving Wake up

Normal All active - -

Standby Logic off

Memory on

50% Fast

Sleep Logic and

Memory off

80% Slow

Energy efficiency increases by 60%,

Memory & Storage Technologies

100000

SRAM DRAM NAND, PCM Disk

Energy/bit

(Pico Joule)

Capacity

(G Bit)

Cost/bit

(Pico $)

(Endurance issue)

Source: Intel

Compare Memory Technologies

Source: Intel

DRAM for first level capacity memory NAND/PCM for next level storage

Revise DRAM Architecture

Page Page Page

Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins

Traditional DRAM New DRAM architecture

Page Page Page Addr

Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)

90 65 45 32 22 14 10 7

exponentially

increasing BW

(GB/sec)

exponentially

decreasing

energy (pJ/bit)

Package

3D-Integration of DRAM and Logic

Logic Buffer

Logic Buffer Chip

Technology optimized for:

High speed signaling

Energy efficient logic circuits

Implement intelligence

DRAM Stack

Technology optimized for:

Memory density

Lower cost

3D Integration provides best of both worlds

1Tb/s HMC DRAM Prototype

• 3D integration technology • 1Gb DRAM Array • 512 MB total DRAM/cube • 128GB/s Bandwidth • <10 pj/bit energy

Bandwidth Energy Efficiency

DDR-3 (Today) 10.66 GB/Sec 50-75 pJ/bit

Hybrid Memory Cube 128 GB/Sec 8 pJ/bit

10X higher bandwidth, 10X lower energy Source: Micron

Communication Energy

0.1 1 10 100 1000

Interconnect Distance (cm)

On Die

Board to Board

On Die

Chip to chip

Board to Board

Between

cabinets

On-die Interconnect

Interconnect energy (per mm) reduces slower than compute

On-die data movement energy will start to dominate

90 65 45 32 22 14 10 7

Technology (nm) Source: Intel

Compute energy

On die IC energy

Network On Chip (NoC)

IMEM +

21% 10-port

Router +

FPMACs

Global Clocking

Routers & 2D-mesh10%

MC & DDR3-

80019%

Cores70%

12.64mmI/O Area

I/O Area

single tile

12.64mmI/O Area

I/O Area

single tile

8 X 10 Mesh

32 bit links

320 GB/sec bisection BW @ 5 GHz

80 Core TFLOP Chip (2006)

26.5mm

System Interface + I/O

2 Core clusters in 6 X 4 Mesh

(why not 6 x 8?)

128 bit links

256 GB/sec bisection BW @ 2 GHz

48 Core Single Chip Cloud (2009)

On-chip Interconnects

100000

0 5 10 15 20 25

Length (mm)

0.3u pitch, 0.5V

Wire Delay

Router Energy

Wire Energy

Repeated wire delay

Router Delay

Circuit Switched NoC

27 Anders et al, A 4.1Tb/s bisection-bandwidth 560Gb/s/W streaming circuit-switched 8×8 mesh network-on-chip in 45nm CMOS, ISSCC 2010

8x8 NoCClock I/O

Traffic Generation

and Measurement

Process 45nm Hi-K/MG

CMOS, 9 Metal Cu

Nominal Supply 1.1V

Arbitration and

Router Logic Supports 512b data

Number of

Transistors 2.85M

Die Area 6.25mm2

Narrow, high freq packet switched

network establishes a circuit

Data is transferred using wide and

slower established circuit switched bus

Differential, low swing bus improves

energy efficiency

2 to 3X increased energy efficiency over packet switched network

Hierarchical & Heterogeneous

Bus to connect over short distances

2nd Level Bus

Hierarchy of Busses Or hierarchical circuit and packet switched networks

Electrical Interconnect < 1 Meter

1.2u 0.5u 0.25u 0.13u 65nm 32nm

Energy (pJ/bit)

Data rate (Gb/sec)

Source: ISSCC papers

BW and Energy efficiency improves, but not enough

Electrical Interconnect Advances

Low-loss flex

connector

Low-loss twinax

Employ, new, low-loss, non-traditional interconnects

Top of the package

connector

Co-optimization of interconnects

and circuits for energy efficiency

O’Mahony et al, “A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS “, ISSCC 2010; and J. Jaussi, RESS004, IDF 2010

0 100 200 300 400E

rgy (p

Channel length (cm)

Twinax

State of the art

-0.5 -0.25 0 0.25 0.5

Optical Interconnect > 1 Meter

1% 10% 20%

Laser efficiency

100% Link

utilization

100% 50% 10%

Link utilization

Laser efficiency

1%, 10%, 20%

Energy in supporting electronics is very low

Link energy dominated by laser (efficiency)

Sustained, high link utilization required

Source: PETE Study group

Straw-man System Interconnect

~0.8 PF

......... Cluster of 35

1,000 fibers each

............... 35 Clusters

8,000 fibers each

Assume: 40 Gbps, 10 pJ/b, $0.6/Gbps, 8B/FLOP, naïve tapering

$35M 217 MW

Bandwidth Tapering

0.0045

0.0005

0.00005

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

Core L1 L2 Chip Board Cab L1SysL2Sys

8 Byte/Flop total

Naïve, 4X

Severe

L1 L2 Chip Board Cab Sys

Total DM Power = 3 MW

L1 L2 Chip Board Cab Sys

Total DM Power = 217 MW

Intelligent BW tapering is necessary

Road to Unreliability?

From Peta to Exa Reliability Issues

1,000X parallelism More hardware for something to go wrong

>1,000X intermittent faults due to soft errors

Aggressive Vcc scaling

to reduce power/energy

Gradual faults due to increased variations

More susceptible to Vcc droops (noise)

More susceptible to dynamic temp variations

Exacerbates intermittent faults—soft errors

Deeply scaled

technologies

Aging related faults

Lack of burn-in?

Variability increases dramatically

Resiliency will be the corner-stone

Soft Errors and Reliability

0.5 1 1.5 2

Voltage (V)

Soft error/bit reduces each generation Nominal impact of NTV on soft error rate

180 130 90 65 45 32

Technology (nm)

tive t

Memory

Assuming 2X bit/latch

count increase per generation

Soft error at the system level will continue to increase

Positive impact of NTV on reliability

Low V lower E fields, low power lower temperature

Device aging effects will be less of a concern

Lower electromigration related defects

N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006

Resiliency Faults Example

Permanent faults Stuck-at 0 & 1

Gradual faults Variability

Temperature

Intermittent faults Soft errors

Voltage droops

Aging faults Degradation

Faults cause errors (data & control)

Datapath errors Detected by parity/ECC

Silent data corruption Need HW hooks

Control errors Control lost (Blue screen)

Minimal overhead for resiliency

Circuit & Design

Microarchitecture

Microcode, Platform

Programming system

Applications Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt

System Software

Architecture needs a Paradigm Shift

Must revisit and evaluate each (even legacy) architecture feature

Single thread performance Frequency

Programming productivity Legacy, compatibility

Architecture features for productivity

Constraints (1) Cost

(2) Reasonable Power/Energy

Throughput performance Parallelism, application specific HW

Power/Energy Architecture features for energy

Simplicity

Constraints (1) Programming productivity

(2) Cost

Architect’s past and present priorities—

Architect’s future priorities should be—

HW-SW Co-design

Circuits & Design

Applications and SW stack provide guidance for efficient system design

Architecture

Programming Sys

System SW

Applications

Limitations, issues and opportunities to exploit

Bottom-up Guidance

1. NTV reduces energy but exacerbates variations

Small & Fast cores Random distribution Temp dependent

3. On-die Interconnect energy (per mm) does not reduce as much as compute

45 32 22 14 10 7

Technology (nm)

Compute Energy

Interconnect Energy6X compute 1.6X interconnect

2. Limited NTV for arrays (memory) due to stability issues

Voltage

Compute

MemoryDisproportionate Memory arrays can be made larger

4. At NTV, leakage power is substantial portion of the total power

100% Vdd

75% Vdd

50% Vdd

40% VddIncreasing Variations

Expect 50% leakage Idle hardware consumes energy

5. DRAM energy scales, but not enough

90nm 65nm 45nm 32nm 22nm 14nm 10nm 7nm

DRAM Energy (pJ/b)

3D Hybrid Memory Cube

50 pJ/b today 8 pJ/b demonstrated Need < 2pJ/b

6. System interconnect limited by laser energy and cost

Data Movement Power System

Cabinet

Boards

Clusters

Islands

40 Gbps Photonic links @ 10 pJ/b

BW tapering and locality awareness necessary

Today’s HW System Architecture

Socket

Coherent domain 1.5 TF Peak 660 pJ/F 660 MW/Exa

10 mB of DRAM/F

Non-coherent domain

Today’s programming model comprehends this system

architecture

32K I$

32K D$

128K I$

128K D$

16 MB L3

32K I$

32K D$

128K I$

128K D$

…X 16… ~ 3GHz ~100 W

384 GF/s Peak 260 pJ/F 260 MW/Exa 55 mB of local memory/F

Processor

Straw-man Exascale Processor

C C C C

C C C C Lo

Shared Cache

First level of hierarchy Next level of hierarchy PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect ……

……

600K Transistors

Simplest Core

Technology 7nm, 2018

Die area 500 mm2

Cores 2048

Frequency 4.2 GHz

TFLOPs 17.2

Power 600 Watts

E Efficiency 34 pJ/Flop

Computations alone consume 34 MW for Exascale

Processor

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

re1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

………..

Last level cache

……

Interconnect

Straw-man Architecture at NTV

Full Vdd 50% Vdd

Technology 7nm, 2018

Die area 500 mm2

Cores 2048

Frequency 4.2 GHz 600 MHz

TFLOPs 17.2 2.5

Power 600 Watts 37 Watts

E Efficiency 34 pJ/Flop 15 pJ/Flop

Compute energy efficiency close to Exascale goal

600K Transistors

Simplest Core

C C C C

cShared Cache

First level of hierarchy Next level of hierarchyPE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

Processor

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

Next level cache

Interconnect……

……

………..

Last level cache

……

Interconnect

Reduced frequency and flops

Reduced power and improved E-efficiency

Local Memory Capacity

L1 L2 L3 L4

Memory Hierarchy

Traditional

Exa-Strawman

At NTV

Higher local memory capacity promotes data locality

Interconnect Structures

Buses over short distance

Shared bus

1 to 10 fJ/bit 0 to 5mm Limited scalability

Multi-ported Memory

Shared memory

10 to 100 fJ/bit 1 to 5mm Limited scalability

Cross Bar Switch

0.1 to 1pJ/bit 2 to 10mm Moderate scalability

1 to 3pJ/bit >5 mm, scalable

Packet Switched Network Board Cabinet

Second LevelSwitch

First levelSwitch

………………………… …………………………

…………………………

CabinetCluster

SwitchCluster

System

SW Challenges

1.Extreme parallelism (1000X due to Exa, additional 4X due to NTV)

2.Data locality—reduce data movement

3.Intelligent scheduling—move thread to data if necessary

4.Fine grain resource management (objective function)

5.Applications and algorithms incorporate paradigm change

cution m

Programming & Execution Model

Event driven tasks (EDT)

Dataflow inspired, tiny codelets (self contained)

Non blocking, no preemption

Programming model:

Separation of concerns: Domain specification & HW mapping

Express data locality with hierarchical tiling

Global, shared, non-coherent address space

Optimization and auto generation of EDTs (HW specific)

Execution model:

Dynamic, event-driven scheduling, non-blocking

Dynamic decision to move computation to data

Observation based adaption (self-awareness)

Implemented in the runtime environment

Separation of concerns:

User application, control, and resource management

Over-provisioning, Introspection, Self-awareness

Addressing variations

1. Provide more compute HW 2. Law of large numbers 3. Static profile

S M M F

1. Schedule threads based on objectives and resources

2. Dynamically control and manage resources

3. Identify sensors, functions in HW for implementation

System SW implements introspective execution model

S M M F

Dynamic reconfiguration: 1. Energy efficiency 2. Latency 3. Dynamic resource

management

Fine grain resource mgmt

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

8MB Shared LLC

Interconnect………..

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

PE PE PE PE

PE PE PE PEServ

1MB L2

………..

8MB Shared LLC

Interconnect………..

64MB Shared LLC

………..

Interconnect

Processor Chip (16 Clusters)

Sensors for introspection

1. Energy consumption 2. Instantaneous power 3. Computations 4. Data movement

Over-provisioned Introspectively Resource

Managed System

Over-provisioned in design

Dynamically tuned for the given objective

System Power Data MovementPower w/o DMSys Goal

Data Movement Power System

Cabinet

Boards

Clusters

Islands

40 Gbps Photonic links @ 10 pJ/b

3.23 MW

Summary

Power & energy challenge continues

Opportunistically employ NTV operation

3D integration for DRAM

Communication energy will far exceed computation

Data locality will be paramount

Revolutionary software stack needed to make

Exascale real

exascale computing a fact or a fiction? - ipdps · 1 exascale computing— a fact or a fiction?...

Documents

megalodon fact or fiction

fees: fact and fiction

radio fact or fiction

halloween fact or fiction

fact or fiction?: teaching with historical fiction

fact, fiction, and fp

fact & fiction in cosmology

tree octopus. fact/fiction?

gats – fact and fiction

fiction or fact?

resurrection: fact or fiction

fact to fiction

concussions: fact v. fiction

fact or fiction: teaching with historical fiction

science fiction, technology fact science fiction, technology...

“new” fact or fiction

science fiction technology fact

twixt fact and fiction -...

design fiction: on design, science, fact and fiction

fact or fiction. fact or fiction? bed bugs are a fairytale...