exascale computing a fact or a fiction? - ipdps · 1 exascale computing— a fact or a fiction?...
Post on 21-Apr-2018
216 Views
Preview:
TRANSCRIPT
1
Exascale Computing—
a fact or a fiction?
Shekhar Borkar
Intel Corp.
May 21, 2013
This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the
official policies, either expressed or implied, of the U.S. Government.
IPDPS 2013
2
Outline
Compute roadmap & technology outlook
Challenges & solutions for:
– Compute,
– Memory,
– Interconnect,
– Resiliency, and
– Software stack
Summary
3
Compute Performance Roadmap
1.E-04
1.E-02
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1960 1970 1980 1990 2000 2010 2020
GF
LO
P
Mega
Giga
Tera
Peta
Exa
12 Years 11 Years 10 Years
Client
Hand-held
4
From Giga to Exa, via Tera & Peta
1
10
100
1000
1986 1996 2006 2016
Re
lati
ve
Pro
c F
req
G
Tera
Peta
30X 250X
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
36X
Exa
4,000X
Concurrency
2.5M X
Processor Performance
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1986 1996 2006 2016
G
Tera
Peta
80X
Exa
4,000X
1M X
Power
System performance increases faster
Parallelism continues to increase
Power & energy challenge continues
5
Where is the Energy Consumed?
100pJ per FLOP 100W
150W
100W
100W
2.5KW
Decode and control Address translations… Power supply losses Bloated with inefficient architectural features
~3KW
Compute
Memory
Com
Disk
10TB disk @ 1TB/disk @10W
0.1B/FLOP @ 1.5nJ per Byte
100pJ com per FLOP
5W 2W
~5W ~3W 5W
Goal
~20W
Teraflop system today
The UHPC* Challenge
6
20 pJ/Operation
20MW, Exa 20W, Tera
20KW, Peta
*DARPA, Ubiquitous HPC Program
20 mW, Mega 2W, 100 Giga
20 mW, Giga
Technology Scaling Outlook
7
0
10
20
30
40
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Rela
tive
Transistor Density
1.75 – 2X
0
0.5
1
1.5
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Rela
tive
Frequency
Almost flat
0
0.2
0.4
0.6
0.8
1
1.2
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Rela
tive
Supply Voltage
Almost flat
0.001
0.01
0.1
1
45nm 32nm 22nm 14nm 10nm 7nm 5nm
Re
lative
Energy
Some scaling
Ideal
Energy per Compute Operation
8
0
50
100
150
200
250
300
45nm 32nm 22nm 14nm 10nm 7nm
En
erg
y (
pJ)
pJ/bit Com
pJ/bit DRAM
DP RF Op
pJ/DP FP
FP Op
DRAM
Communication
Operands
75 pJ/bit
25 pJ/bit
10 pJ/bit
100 pJ/bit
Source: Intel
9
Voltage Scaling
0
2
4
6
8
10
0
0.2
0.4
0.6
0.8
1
1.2
0.3 0.5 0.7 0.9
No
rmali
zed
Vdd (Normal)
Freq
Total Power Leakage
Energy Efficiency
When designed to voltage scale
10
Near Threshold-Voltage (NTV)
1
101
103
104
102
10-2
10-1
1
101
102
0.2 0.4 0.6 0.8 1.0 1.2 1.4
Supply Voltage (V)
Maxim
um
Fre
qu
en
cy (
MH
z)
To
tal
Po
wer
(mW
)
320mV
65nm CMOS, 50°C
320mV S
ub
thre
sh
old
Reg
ion
9.6X
65nm CMOS, 50°C
10 -2
10 -1
1
10 1
0
50
100
150
200
250
300
350
400
450
0.2 0.4 0.6 0.8 1.0 1.2 1.4
Supply Voltage (V)
En
erg
y E
ffic
ien
cy (
GO
PS
/Watt
)
Acti
ve L
eakag
e P
ow
er
(mW
)
H. Kaul et al, 16.6: ISSCC08
4 O
rd
ers
< 3
Ord
ers
Subthreshold Leakage at NTV
11
0%
10%
20%
30%
40%
50%
60%
45nm 32nm 22nm 14nm 10nm 7nm 5nm
SD
Leakag
e P
ow
er
100% Vdd
75% Vdd
50% Vdd
40% Vdd Increasing
Variations
NTV operation reduces total power, improves energy efficiency
Subthreshold leakage power is substantial portion of the total
Mitigating Impact of Variation
12
1.Variation control with body biasing
Body effect is substantially reduced in advanced
technologies
Energy cost of body biasing could become substantial
Fully-depleted transistors have no body left
f
f
f f
f
f
f
f
f
f
f f
Example: Many-core System
f
f
f f
f/2
f/2
f/2
f/2
f/4
f/4
f/4 f/4
Running all cores at full frequency exceeds energy budget
Run cores at the native frequency Law of large numbers—averaging
2. Variation tolerance at the system level
Experimental NTV Processor
13
IA-32 Core
Logic
Sc
an
R
O M
L1$-I L1$-D
Level Shifters + clk spine
1.1 mm
1.8
mm
Custom Interposer 951 Pin FCBGA Package
Legacy Socket-7 Motherboard
Technology 32nm High-K Metal Gate
Interconnect 1 Poly, 9 Metal (Cu)
Transistors 6 Million (Core)
Core Area 2mm2
S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012
Wide Dynamic Range
14
NTV
EN
ER
GY
EF
FIC
EN
CY
HIGH
LOW
VOLTAGE ZERO MAX
~5x
Demonstrated
Normal operating range Subthreshold
Ultra-low Power Energy Efficient High Performance
280 mV 0.45 V 1.2 V
3 MHz 60 MHz 915 MHz
2 mW 10 mW 737 mW
1500 Mips/W 5830 Mips/W 1240 Mips/W
Observations
15
0%
20%
40%
60%
80%
100%
Sub-Vt NTV FullVdd
Mem Lkg
Mem Dyn
Logic Lkg
Logic Dyn
Leakage power dominates Fine grain leakage power management is required
Integration of Power Delivery
16
For efficiency and management
Standard OLGA Packaging
Technology
TOP
BOTTOM
Converter
chipLoad chip
Inductors
Input
CapacitorsOutput
Capacitors
5mmRF
Launch
Integrated Voltage Regulator Testchip
70
75
80
85
90
0 5 10 15 20
Load Current [A]E
ffic
ien
cy [
%]
L = 1.9nH
L = 0.8nH2.4V to 1.5V
2.4V to 1.2V
60MHz
100MHz80MHz
Schrom et al, “A 100MHz 8-Phase Buck Converter Delivering 12A in 25mm2 Using Air-Core Inductors”, APEC 2007
Power delivery closer to the load for 1. Improved efficiency 2. Fine grain power management
Fine-grain Power Management
Dynamic Chip Level
STANDBY:
• Memory retains data
• 50% less power/tile
FULL SLEEP:
• Memories fully off
• 80% less power/tile
Dynamic, within a core
21 sleep regions per tile
FP
Router
Data Memory
Instruction Memory
FP Engine 1 Sleeping: 90% less power
FP Engine 2 Sleeping: 90% less power
Router Sleeping: 10% less power (stays on to pass traffic)
Data Memory Sleeping: 57% less power
Instruction Memory Sleeping: 56% less power
17 Vangal et al, An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS, JSSC, January 2008
Mode Power Saving Wake up
Normal All active - -
Standby Logic off
Memory on
50% Fast
Sleep Logic and
Memory off
80% Slow
Energy efficiency increases by 60%,
Memory & Storage Technologies
18
0.01
0.1
1
10
100
1000
10000
100000
SRAM DRAM NAND, PCM Disk
Energy/bit
(Pico Joule)
Capacity
(G Bit)
Cost/bit
(Pico $)
(Endurance issue)
Source: Intel
Compare Memory Technologies
19
Source: Intel
DRAM for first level capacity memory NAND/PCM for next level storage
20
Revise DRAM Architecture
Page Page Page
RAS
CAS
Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins
Traditional DRAM New DRAM architecture
Page Page Page Addr
Addr
Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)
1
10
100
1000
90 65 45 32 22 14 10 7
(nm)
Need
exponentially
increasing BW
(GB/sec)
Need
exponentially
decreasing
energy (pJ/bit)
Package
3D-Integration of DRAM and Logic
Logic Buffer
Logic Buffer Chip
Technology optimized for:
High speed signaling
Energy efficient logic circuits
Implement intelligence
DRAM Stack
Technology optimized for:
Memory density
Lower cost
3D Integration provides best of both worlds
1Tb/s HMC DRAM Prototype
• 3D integration technology • 1Gb DRAM Array • 512 MB total DRAM/cube • 128GB/s Bandwidth • <10 pj/bit energy
Bandwidth Energy Efficiency
DDR-3 (Today) 10.66 GB/Sec 50-75 pJ/bit
Hybrid Memory Cube 128 GB/Sec 8 pJ/bit
10X higher bandwidth, 10X lower energy Source: Micron
Communication Energy
23
0.01
0.1
1
10
100
0.1 1 10 100 1000
En
erg
y/b
it (
pJ/b
it)
Interconnect Distance (cm)
On Die
Board to Board
On Die
Chip to chip
Board to Board
Between
cabinets
On-die Interconnect
Interconnect energy (per mm) reduces slower than compute
On-die data movement energy will start to dominate
0
0.2
0.4
0.6
0.8
1
1.2
90 65 45 32 22 14 10 7
Technology (nm) Source: Intel
Compute energy
On die IC energy
24
Network On Chip (NoC)
25
IMEM +
DMEM
21% 10-port
RF
4%
Router +
Links
28%
Clock
dist.
11%
Dual
FPMACs
36%
Global Clocking
1%
Routers & 2D-mesh10%
MC & DDR3-
80019%
Cores70%
21.7
2m
m
12.64mmI/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
21.7
2m
m
12.64mmI/O Area
I/O Area
PLL
single tile
1.5mm
2.0mm
TAP
8 X 10 Mesh
32 bit links
320 GB/sec bisection BW @ 5 GHz
80 Core TFLOP Chip (2006)
VRC
21
.4m
m
26.5mm
System Interface + I/O
DD
R3
MC
D
DR
3 M
C
DD
R3
MC
D
DR
3 M
C
PLL
TILE
TILE
JTAG
C C
C C
2 Core clusters in 6 X 4 Mesh
(why not 6 x 8?)
128 bit links
256 GB/sec bisection BW @ 2 GHz
48 Core Single Chip Cloud (2009)
On-chip Interconnects
26
100
1000
10000
100000
0 5 10 15 20 25
Length (mm)
Dela
y (
ps)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
pJ/B
it
0.3u pitch, 0.5V
Wire Delay
Router Energy
Wire Energy
Repeated wire delay
Router Delay
Circuit Switched NoC
27 Anders et al, A 4.1Tb/s bisection-bandwidth 560Gb/s/W streaming circuit-switched 8×8 mesh network-on-chip in 45nm CMOS, ISSCC 2010
8x8 NoCClock I/O
Ports
Core
North
South
East
West
Traffic Generation
and Measurement
Process 45nm Hi-K/MG
CMOS, 9 Metal Cu
Nominal Supply 1.1V
Arbitration and
Router Logic Supports 512b data
Number of
Transistors 2.85M
Die Area 6.25mm2
Narrow, high freq packet switched
network establishes a circuit
Data is transferred using wide and
slower established circuit switched bus
Differential, low swing bus improves
energy efficiency
2 to 3X increased energy efficiency over packet switched network
28
Hierarchical & Heterogeneous
Bus
C C
C C
Bus to connect over short distances
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
Bus
C C
C C
2nd Level Bus
Hierarchy of Busses Or hierarchical circuit and packet switched networks
R R
R R
Electrical Interconnect < 1 Meter
29
0.1
1
10
100
1000
1.2u 0.5u 0.25u 0.13u 65nm 32nm
Energy (pJ/bit)
Data rate (Gb/sec)
Source: ISSCC papers
BW and Energy efficiency improves, but not enough
Electrical Interconnect Advances
Low-loss flex
connector
Low-loss twinax
Employ, new, low-loss, non-traditional interconnects
Top of the package
connector
Co-optimization of interconnects
and circuits for energy efficiency
O’Mahony et al, “A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS “, ISSCC 2010; and J. Jaussi, RESS004, IDF 2010
0
1
2
3
4
0 100 200 300 400E
ne
rgy (p
J/b
it)
Channel length (cm)
HDI
Flex
Twinax
State of the art
-0.5 -0.25 0 0.25 0.5
-10
0
10
-12
-9
-6
-3
0
-0.5 -0.25 0 0.25 0.5
10
0
-10
Optical Interconnect > 1 Meter
31
0
2
4
6
8
10
1% 10% 20%
En
erg
y (
pJ
/bit
)
Laser efficiency
Laser
100% Link
utilization
1
10
100
100% 50% 10%
En
erg
y (
pJ
/bit
)
Link utilization
Laser efficiency
1%, 10%, 20%
Energy in supporting electronics is very low
Link energy dominated by laser (efficiency)
Sustained, high link utilization required
Source: PETE Study group
Straw-man System Interconnect
32
...
~0.8 PF
......... Cluster of 35
...
1,000 fibers each
...
............... 35 Clusters
............... 35 Clusters
8,000 fibers each
Assume: 40 Gbps, 10 pJ/b, $0.6/Gbps, 8B/FLOP, naïve tapering
$35M 217 MW
Bandwidth Tapering
33
24
6.65
1.13
0.19
0.03
0.0045
0.0005
0.00005
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
Core L1 L2 Chip Board Cab L1SysL2Sys
Byte
/FL
OP
8 Byte/Flop total
Naïve, 4X
Severe
0
0.5
1
1.5
L1 L2 Chip Board Cab Sys
Data
M
ovem
en
t P
ow
er
(MW
)
Total DM Power = 3 MW
0.1
1
10
100
1000
L1 L2 Chip Board Cab Sys
Data
M
ovem
en
t P
ow
er
(MW
)
Total DM Power = 217 MW
Intelligent BW tapering is necessary
34
Road to Unreliability?
From Peta to Exa Reliability Issues
1,000X parallelism More hardware for something to go wrong
>1,000X intermittent faults due to soft errors
Aggressive Vcc scaling
to reduce power/energy
Gradual faults due to increased variations
More susceptible to Vcc droops (noise)
More susceptible to dynamic temp variations
Exacerbates intermittent faults—soft errors
Deeply scaled
technologies
Aging related faults
Lack of burn-in?
Variability increases dramatically
Resiliency will be the corner-stone
Soft Errors and Reliability
35
0
0.2
0.4
0.6
0.8
1
0.5 1 1.5 2
Voltage (V)
n-S
ER
/ce
ll (
se
a-l
ev
el)
65nm
90nm
130nm
180nm
250nm
Soft error/bit reduces each generation Nominal impact of NTV on soft error rate
1
10
180 130 90 65 45 32
Technology (nm)
Rela
tive t
o 1
30n
m
Memory
Latch
Assuming 2X bit/latch
count increase per generation
Soft error at the system level will continue to increase
Positive impact of NTV on reliability
Low V lower E fields, low power lower temperature
Device aging effects will be less of a concern
Lower electromigration related defects
N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006
36
Resiliency Faults Example
Permanent faults Stuck-at 0 & 1
Gradual faults Variability
Temperature
Intermittent faults Soft errors
Voltage droops
Aging faults Degradation
Faults cause errors (data & control)
Datapath errors Detected by parity/ECC
Silent data corruption Need HW hooks
Control errors Control lost (Blue screen)
Minimal overhead for resiliency
Circuit & Design
Microarchitecture
Microcode, Platform
Programming system
Applications Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt
System Software
37
Architecture needs a Paradigm Shift
Must revisit and evaluate each (even legacy) architecture feature
Single thread performance Frequency
Programming productivity Legacy, compatibility
Architecture features for productivity
Constraints (1) Cost
(2) Reasonable Power/Energy
Throughput performance Parallelism, application specific HW
Power/Energy Architecture features for energy
Simplicity
Constraints (1) Programming productivity
(2) Cost
Architect’s past and present priorities—
Architect’s future priorities should be—
HW-SW Co-design
38
Circuits & Design
Applications and SW stack provide guidance for efficient system design
Architecture
Programming Sys
System SW
Applications
Limitations, issues and opportunities to exploit
Bottom-up Guidance
39
1. NTV reduces energy but exacerbates variations
Small & Fast cores Random distribution Temp dependent
3. On-die Interconnect energy (per mm) does not reduce as much as compute
0
0.2
0.4
0.6
0.8
1
1.2
45 32 22 14 10 7
Re
lati
ve
Technology (nm)
Compute Energy
Interconnect Energy6X compute 1.6X interconnect
2. Limited NTV for arrays (memory) due to stability issues
Perf
orm
ance
Voltage
Compute
MemoryDisproportionate Memory arrays can be made larger
4. At NTV, leakage power is substantial portion of the total power
0%
10%
20%
30%
40%
50%
60%
45nm 32nm 22nm 14nm 10nm 7nm 5nm
SD L
eak
age
Po
we
r
100% Vdd
75% Vdd
50% Vdd
40% VddIncreasing Variations
Expect 50% leakage Idle hardware consumes energy
5. DRAM energy scales, but not enough
1
10
100
1000
90nm 65nm 45nm 32nm 22nm 14nm 10nm 7nm
DRAM Energy (pJ/b)
3D Hybrid Memory Cube
50 pJ/b today 8 pJ/b demonstrated Need < 2pJ/b
6. System interconnect limited by laser energy and cost
0
50
100
150
200
250
300
45nm 32nm 22nm 14nm 10nm 7nm 5nm
MW
Data Movement Power System
Cabinet
Boards
Die
Clusters
Islands
40 Gbps Photonic links @ 10 pJ/b
BW tapering and locality awareness necessary
Today’s HW System Architecture
40
Socket
4GB
DRAM
Socket
4GB
DRAM
Socket
4GB
DRAM
Socket
4GB
DRAM
Coherent domain 1.5 TF Peak 660 pJ/F 660 MW/Exa
10 mB of DRAM/F
...
Non-coherent domain
Today’s programming model comprehends this system
architecture
Core
X8 FP
Small
RF
32K I$
32K D$
128K I$
128K D$
16 MB L3
Core
X8 FP
Small
RF
32K I$
32K D$
128K I$
128K D$
…X 16… ~ 3GHz ~100 W
384 GF/s Peak 260 pJ/F 260 MW/Exa 55 mB of local memory/F
Processor
Straw-man Exascale Processor
41
C C C C
C C C C Lo
gic
Shared Cache
First level of hierarchy Next level of hierarchy PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect ……
…..
……
…..
600K Transistors
Simplest Core
*
RF
Logic
Technology 7nm, 2018
Die area 500 mm2
Cores 2048
Frequency 4.2 GHz
TFLOPs 17.2
Power 600 Watts
E Efficiency 34 pJ/Flop
Computations alone consume 34 MW for Exascale
Processor
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
………..
………..
Last level cache
……
…..
……
…..
Interconnect
Straw-man Architecture at NTV
42
Full Vdd 50% Vdd
Technology 7nm, 2018
Die area 500 mm2
Cores 2048
Frequency 4.2 GHz 600 MHz
TFLOPs 17.2 2.5
Power 600 Watts 37 Watts
E Efficiency 34 pJ/Flop 15 pJ/Flop
Compute energy efficiency close to Exascale goal
600K Transistors
Simplest Core
*
RF
Logic
C C C C
C C C C
Logi
cShared Cache
First level of hierarchy Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Processor
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
Next level of hierarchyPE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
Next level cache
Interconnect……
…..
……
…..
………..
………..
Last level cache
……
…..
……
…..
Interconnect
Reduced frequency and flops
Reduced power and improved E-efficiency
Local Memory Capacity
43
1
10
100
1000
L1 L2 L3 L4
Mic
ro-B
yte
/FL
OP
s
Memory Hierarchy
Traditional
Exa-Strawman
At NTV
Higher local memory capacity promotes data locality
Interconnect Structures
44
Buses over short distance
Shared bus
1 to 10 fJ/bit 0 to 5mm Limited scalability
Multi-ported Memory
Shared memory
10 to 100 fJ/bit 1 to 5mm Limited scalability
X-Bar
Cross Bar Switch
0.1 to 1pJ/bit 2 to 10mm Moderate scalability
1 to 3pJ/bit >5 mm, scalable
Packet Switched Network Board Cabinet
...
Second LevelSwitch
...
...
First levelSwitch
...
...
………………………… …………………………
…………………………
CabinetCluster
SwitchCluster
System
SW Challenges
45
1.Extreme parallelism (1000X due to Exa, additional 4X due to NTV)
2.Data locality—reduce data movement
3.Intelligent scheduling—move thread to data if necessary
4.Fine grain resource management (objective function)
5.Applications and algorithms incorporate paradigm change
Pro
gra
mm
ing
mo
del
Exe
cution m
odel
Programming & Execution Model
46
Event driven tasks (EDT)
Dataflow inspired, tiny codelets (self contained)
Non blocking, no preemption
Programming model:
Separation of concerns: Domain specification & HW mapping
Express data locality with hierarchical tiling
Global, shared, non-coherent address space
Optimization and auto generation of EDTs (HW specific)
Execution model:
Dynamic, event-driven scheduling, non-blocking
Dynamic decision to move computation to data
Observation based adaption (self-awareness)
Implemented in the runtime environment
Separation of concerns:
User application, control, and resource management
Over-provisioning, Introspection, Self-awareness
47
F F
M F
S
S
F S S
Addressing variations
1. Provide more compute HW 2. Law of large numbers 3. Static profile
M
S
S
S M M F
1. Schedule threads based on objectives and resources
2. Dynamically control and manage resources
3. Identify sensors, functions in HW for implementation
System SW implements introspective execution model
F F
M F
S
S
F S S
M
S
S
S M M F
Dynamic reconfiguration: 1. Energy efficiency 2. Latency 3. Dynamic resource
management
Fine grain resource mgmt
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect………..
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
PE PE PE PE
PE PE PE PEServ
ice
Co
re
1MB L2
………..
8MB Shared LLC
Interconnect………..
64MB Shared LLC
………..
………..
Interconnect
Processor Chip (16 Clusters)
Sensors for introspection
1. Energy consumption 2. Instantaneous power 3. Computations 4. Data movement
Over-provisioned Introspectively Resource
Managed System
48
Over-provisioned in design
Dynamically tuned for the given objective
0
20
40
60
80
100
120
140
45nm 32nm 22nm 14nm 10nm 7nm 5nm
MW
System Power Data MovementPower w/o DMSys Goal
20 MW
0
2
4
6
8
10
12
14
45nm 32nm 22nm 14nm 10nm 7nm 5nm
MW
Data Movement Power System
Cabinet
Boards
Die
Clusters
Islands
40 Gbps Photonic links @ 10 pJ/b
3.23 MW
top related