challenges and opportunities - intel · challenges and opportunities fayé a briggs, phd intel...
TRANSCRIPT
Intel Confidential
Memory Architectures & Technologies for
Petascale Computing: Challenges and Opportunities
Fayé A Briggs, PhD Intel Fellow
Director of Scalable Server Architecture
Data Center Group
Intel Architecture Group
August, 2011
• Multi-Core Processor Performance Drivers
• Memory Bandwidth Trends & Issues
• Memory Technology Trends
• Summary
Agenda
Moore’s
Law
System
Growth
Life Sciences/Genomics Multi-physics CAD & Manufacturing
Climate modeling & weather prediction
Research & analytics Energy & oil exploration
Digital content creation
Astrophysics
Micro-architecture Performance Parameters
• Core capability (IPC)
• Core-count
• On-die Interconnect
• Frequency
• Cache Size
• Memory Latency
• Memory bandwidth & Size
• Memory Technologies
Many Knobs for Optimizing Performance within Power & Costs Constraints
Performance = Frequency * Instructions Per Cycle
(IPC)
Power α V * V * Frequency
Frequency α voltage,
so frequency reduction coupled with voltage reduction
results in cubic reduction in power.
Cores & Uncores: Performance Challenges
DRAM
QPI
Core
Uncore
C O R E
C O R E
C O R E
Memory Controller QPI Power &
Clock
Interconnect
Cache Cores Memory
…
QPI …
…
…
Last Level Cache
Shown: Nehalem
Architecture
Nehalem EP
Nehalem EP
Tylersburg EP
Intel®
QPI
IOH IOH
NHM-EX
NHM-EX
NHM-EX
NHM-EX
PCI Express* Gen2 PCI Express* Gen2
M
e
m
o
r
y
M
e
m
o
r
y
ICH10
3-level Inclusive Cache Hierarchy
• 1st level caches
– 32kB Instruction cache
– 32kB Data Cache – Support more cache misses in parallel
• 2nd level Unified Cache
– 256 kB per core
– Designed for very low latency
• 3rd level Unified Cache
– Size depends on # of cores
– Inclusive cache
– Core valid bits for minimizing snoop traffic
…
L3 Cache
Core
L2 Cache
L1 Caches
Core
L2 Cache
L1 Caches
Core
L2 Cache
L1 Caches
Why Inclusive?
• Cache acts as a snoop filter
• Only snoop cores when necessary
• Provides Scalability
• Minimizes Latency
Today + 8 years Today
Pe
rfo
rma
nce
1.0GHz
450 MHz
200 MHz 100 MHz
Architectural Efficiency
Speed (frequency)
Growing Gap In Single Thread Performance
Current trend, assumes perf improvement of 20% tock to tock. - Assumes that transistor # grows 30% (1.5:1) – Getting very hard!
3.2 GHz
Process Efficiency
To Get Back To The Trend Line: Need ~62% Perf. Tock To Tock!
Gap
Technology (Giga)
DDR2 / DDR3 DDR3 / DDR4 DDR1 SDR
Increasing Bandwidth to Bridge Multi-core Performance Gap
Peak Memory Bandwidth Trend (GB/Sec)
0
1
10
100
1997 1999 2001 2003 2005 2007 2009 2011 2013
Source: Intel
Intel Estimate Of Bandwidth Trend
Even DDR4 Bandwidth is not enough for Byte/Flop/Sec Balance
Intel® Pentium® III Architecture
Intel® Pentium® II Architecture
Pentium® Architecture
Pentium® 4 Architecture
Intel® Core™ uArch
Processor Performance (GFlops/Sec)
Core Count Vs. Memory Bandwidth
Intel estimates of future trends in bandwidth capability, memory channels and core count. Intel estimates are based in part on historical Intel products and projections. Actual bandwidth capability, core count and number of channels of Intel products will vary based on actual product configurations.
Peak BW / Max Core
Max Core / DDR Channel
Server Core to Memory Ratio
4 6 8 10 12 16 24 32
Core to DDR
channel ratio Peak BW to
core ratio
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
8
9
Assume 3 Channels
Assume 4 Channels
Core Count
If continuing with 3 channels
Technologies to Increase Bandwidth
HE-WS/HPC Traditional CPU BW demand
Traditional CPU BW demand
BW Trend
Source: Intel Forecast
DDR3
Assuming DDR4
increasing channels
eDRAM: replace on-pkg mem controller with very fast flex links to an on-board mem controller
Memory Package
CPU Memory Controller
+ Buffer
Assuming eDRAM at 2X/3 yrs CAGR
Intel estimates of future trends in bandwidth capability. Intel estimates are based in part on historical bandwidth capability of Intel products and projections for bandwidth capability improvement.
Actual bandwidth capability of Intel products will vary based on actual product configurations.
BW Projections
0
50
100
150
200
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
GB
/S (
Pe
r S
kt)
Memory Bandwidth Does Not Scale With Moore’s Law
• Significant increase in CPU computing power
• Memory BW not keeping up
• Gap must be closed to maintain system balance
Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008) Source: Intel Forecast
Traditional CPU
BW demand
HE-WS/HPC BW demand
BW Trend DDR3
Assuming DDR4
BW Projections
0
50
100
150
200
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
GB
/S (
Pe
r S
kt)
0
50
100
150
200
250
300
350
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
20
18
Relative Growth in
Mem BW
# of cores
Flops/cycle/Socket
Expon. (Required
Memory BW)
0100020003000400050006000
4 ch. 12 ch. 20 ch
0
500
1000
1500
2000
1333 MHz 2133 MHz 3200 MHzAssumed identical # of cores in all cases
Memory Bandwidth Memory Latency
No change with
increased BW
No change with
increased latency
Workloads Matter
Assumed identical # of cores in all cases
A Number of Challenges to Deal With
4 8 12 16 20 24 28 32 36 40 0
100
200
300
400
500
600
700
Tera Flops/Socket
Memory BW/Socket (GB/s)
DRAM data pins/memory (@ 10Gb/s)
Pin count low,
but power high
Power low ,
but pin count high
~10pJ
per
Byte
Memory:
~150pJ per
Byte
Chip to chip:
~100pJ per
Byte ‟93 „95 „97 „99 „01 „03 „05 „07 „09
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
Top System Concurrency Trend
Peak Performance (TF)
Memory
(TB)
1 10 100 1000 10000
10000
1000
100
10
1
Peak Performance of some of the most
recent Top 10 machines
Cost/Area Power Extreme Parallelism
Byte/Flop Ratio
Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),
A Complete Re-think of System Level Memory Architecture Is Needed
Consider New Levels of Memory Hierarchy
On CPU SRAM DRAM Storage (HDD) NVM Storage
Latency 1x ~=20x ~=20,000x ~=10,000,000x
$$/bit 1x ~=0.05x ~=0.005x ~=0.0005x
Scaling Challenges with DRAM and NAND
Source: Memory Technology Trend and Future Challenges – Singjoo Hong; Hynix Semiconductor
Evaluate Emerging Memory Technologies
Memory Families Examples
Phase Change Memory Chalcogenide alloys: e.g.G2S2T5
Magnetic Tunnel Junction (MTJ) Spin Torque Transfer RAM
Electrochemical Cells e.g.: CuSiO2
Binary Oxide Filament Cells e.g.: Ni-O
Interfacial Switching “Memristor” e.g.: Cr-doped, STO, PCMO
Bitline
Wordline
Selector
Storage
Element
1T1R
• Selector = transistor, diode in bulk Si • Storage element = resistor • Cell size ~= DRAM • Cost & Density ~= DRAM • Performance depends on memory
Wordlines 3D Cross- Point
• Selector = Thin film device • Storage element = resistor • Cell size ~= NAND • Cost & Density ~= NAND • Performance depends on memory
Memory
Element
Selector
Re-think DRAM Architectures
• Activates one large page
• Lots of reads and writes (refresh)
• Small amount of read data is used
• Power wasted in maintaining/
accessing the array
Page
RA
S
CA
S
Today’s DRAM
• Activates one smaller page
• Fewer Read and write (refresh)
• Most of the read data is used
• IO can be widened to increase BW
Revised DRAM
Page Page Page
RA
S
CA
S
1K bytes
64 Bytes
256B or smaller
64 Bytes
Develop Innovative Packaging & IO Solutions
CPU with 100s
of cores
Heat Extraction
DRAM or
NVRAM die
Interconnect
substrate
Through
Silicon Via
CPU with
100s of cores
DRAM die
Through
Silicon Via
• Pins required + IO Power limits the use of traditional packaging
• Tighter integration between memory and CPU
• High BW and low latency using memory locality
Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008)
More cores per chip will slow some programs [red] unless there‟s a big boost in memory bandwidth [yellow].
“Multicore Is Bad News For Supercomputers”, IEEE Spectrum Nov 2008
Software Support Must Evolve To Minimize Data Movement
Exascale Software Study: Software Challenges in Extreme Scale Systems
Locality Optimizations
SW Focused Compression
Create Self Aware Systems
In Conclusion
Multi and Many Core Systems are here to stay
Memory System is the bottleneck
Energy constrains platform performance Data movement consumes much of that energy
New memory technologies and software support must evolve to reduce data movement
Scaling new memory technologies based on resistance becoming attractive
New memory characteristics drive potential new memory hierarchies