challenges and opportunities - intel · challenges and opportunities fayé a briggs, phd intel...

Intel Confidential

Memory Architectures & Technologies for

Petascale Computing: Challenges and Opportunities

Fayé A Briggs, PhD Intel Fellow

Director of Scalable Server Architecture

Data Center Group

Intel Architecture Group

August, 2011

• Multi-Core Processor Performance Drivers

• Memory Bandwidth Trends & Issues

• Memory Technology Trends

• Summary

Agenda

Moore’s

Law

System

Growth

Life Sciences/Genomics Multi-physics CAD & Manufacturing

Climate modeling & weather prediction

Research & analytics Energy & oil exploration

Digital content creation

Astrophysics

Micro-architecture Performance Parameters

• Core capability (IPC)

• Core-count

• On-die Interconnect

• Frequency

• Cache Size

• Memory Latency

• Memory bandwidth & Size

• Memory Technologies

Many Knobs for Optimizing Performance within Power & Costs Constraints

Performance = Frequency * Instructions Per Cycle

(IPC)

Power α V * V * Frequency

Frequency α voltage,

so frequency reduction coupled with voltage reduction

results in cubic reduction in power.

Cores & Uncores: Performance Challenges

DRAM

QPI

Core

Uncore

C O R E

C O R E

C O R E

Memory Controller QPI Power &

Clock

Interconnect

Cache Cores Memory

…

QPI …

…

…

Last Level Cache

Shown: Nehalem

Architecture

Nehalem EP

Nehalem EP

Tylersburg EP

Intel®

QPI

IOH IOH

NHM-EX

NHM-EX

NHM-EX

NHM-EX

PCI Express* Gen2 PCI Express* Gen2

M

e

m

o

r

y

M

e

m

o

r

y

ICH10

3-level Inclusive Cache Hierarchy

• 1st level caches

– 32kB Instruction cache

– 32kB Data Cache – Support more cache misses in parallel

• 2nd level Unified Cache

– 256 kB per core

– Designed for very low latency

• 3rd level Unified Cache

– Size depends on # of cores

– Inclusive cache

– Core valid bits for minimizing snoop traffic

…

L3 Cache

Core

L2 Cache

L1 Caches

Core

L2 Cache

L1 Caches

Core

L2 Cache

L1 Caches

Why Inclusive?

• Cache acts as a snoop filter

• Only snoop cores when necessary

• Provides Scalability

• Minimizes Latency

Today + 8 years Today

Pe

rfo

rma

nce

1.0GHz

450 MHz

200 MHz 100 MHz

Architectural Efficiency

Speed (frequency)

Growing Gap In Single Thread Performance

Current trend, assumes perf improvement of 20% tock to tock. - Assumes that transistor # grows 30% (1.5:1) – Getting very hard!

3.2 GHz

Process Efficiency

To Get Back To The Trend Line: Need ~62% Perf. Tock To Tock!

Gap

Technology (Giga)

DDR2 / DDR3 DDR3 / DDR4 DDR1 SDR

Increasing Bandwidth to Bridge Multi-core Performance Gap

Peak Memory Bandwidth Trend (GB/Sec)

0

1

10

100

1997 1999 2001 2003 2005 2007 2009 2011 2013

Source: Intel

Intel Estimate Of Bandwidth Trend

Even DDR4 Bandwidth is not enough for Byte/Flop/Sec Balance

Intel® Pentium® III Architecture

Intel® Pentium® II Architecture

Pentium® Architecture

Pentium® 4 Architecture

Intel® Core™ uArch

Processor Performance (GFlops/Sec)

Core Count Vs. Memory Bandwidth

Intel estimates of future trends in bandwidth capability, memory channels and core count. Intel estimates are based in part on historical Intel products and projections. Actual bandwidth capability, core count and number of channels of Intel products will vary based on actual product configurations.

Peak BW / Max Core

Max Core / DDR Channel

Server Core to Memory Ratio

4 6 8 10 12 16 24 32

Core to DDR

channel ratio Peak BW to

core ratio

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

8

9

Assume 3 Channels

Assume 4 Channels

Core Count

If continuing with 3 channels

Technologies to Increase Bandwidth

HE-WS/HPC Traditional CPU BW demand

Traditional CPU BW demand

BW Trend

Source: Intel Forecast

DDR3

Assuming DDR4

increasing channels

eDRAM: replace on-pkg mem controller with very fast flex links to an on-board mem controller

Memory Package

CPU Memory Controller

+ Buffer

Assuming eDRAM at 2X/3 yrs CAGR

Intel estimates of future trends in bandwidth capability. Intel estimates are based in part on historical bandwidth capability of Intel products and projections for bandwidth capability improvement.

Actual bandwidth capability of Intel products will vary based on actual product configurations.

BW Projections

0

50

100

150

200

2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

GB

/S (

Pe

r S

kt)

Memory Bandwidth Does Not Scale With Moore’s Law

• Significant increase in CPU computing power

• Memory BW not keeping up

• Gap must be closed to maintain system balance

Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008) Source: Intel Forecast

Traditional CPU

BW demand

HE-WS/HPC BW demand

BW Trend DDR3

Assuming DDR4

BW Projections

0

50

100

150

200

2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

GB

/S (

Pe

r S

kt)

0

50

100

150

200

250

300

350

20

06

20

07

20

08

20

09

20

10

20

11

20

12

20

13

20

14

20

15

20

16

20

17

20

18

Relative Growth in

Mem BW

# of cores

Flops/cycle/Socket

Expon. (Required

Memory BW)

0100020003000400050006000

4 ch. 12 ch. 20 ch

0

500

1000

1500

2000

1333 MHz 2133 MHz 3200 MHzAssumed identical # of cores in all cases

Memory Bandwidth Memory Latency

No change with

increased BW

No change with

increased latency

Workloads Matter

Assumed identical # of cores in all cases

A Number of Challenges to Deal With

4 8 12 16 20 24 28 32 36 40 0

100

200

300

400

500

600

700

Tera Flops/Socket

Memory BW/Socket (GB/s)

DRAM data pins/memory (@ 10Gb/s)

Pin count low,

but power high

Power low ,

but pin count high

~10pJ

per

Byte

Memory:

~150pJ per

Byte

Chip to chip:

~100pJ per

Byte ‟93 „95 „97 „99 „01 „03 „05 „07 „09

1E+02

1E+03

1E+04

1E+05

1E+06

1E+07

Top System Concurrency Trend

Peak Performance (TF)

Memory

(TB)

1 10 100 1000 10000

10000

1000

100

10

1

Peak Performance of some of the most

recent Top 10 machines

Cost/Area Power Extreme Parallelism

Byte/Flop Ratio

Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),

A Complete Re-think of System Level Memory Architecture Is Needed

Consider New Levels of Memory Hierarchy

On CPU SRAM DRAM Storage (HDD) NVM Storage

Latency 1x ~=20x ~=20,000x ~=10,000,000x

$$/bit 1x ~=0.05x ~=0.005x ~=0.0005x

Scaling Challenges with DRAM and NAND

Source: Memory Technology Trend and Future Challenges – Singjoo Hong; Hynix Semiconductor

Evaluate Emerging Memory Technologies

Memory Families Examples

Phase Change Memory Chalcogenide alloys: e.g.G2S2T5

Magnetic Tunnel Junction (MTJ) Spin Torque Transfer RAM

Electrochemical Cells e.g.: CuSiO2

Binary Oxide Filament Cells e.g.: Ni-O

Interfacial Switching “Memristor” e.g.: Cr-doped, STO, PCMO

Bitline

Wordline

Selector

Storage

Element

1T1R

• Selector = transistor, diode in bulk Si • Storage element = resistor • Cell size ~= DRAM • Cost & Density ~= DRAM • Performance depends on memory

Wordlines 3D Cross- Point

• Selector = Thin film device • Storage element = resistor • Cell size ~= NAND • Cost & Density ~= NAND • Performance depends on memory

Memory

Element

Selector

Re-think DRAM Architectures

• Activates one large page

• Lots of reads and writes (refresh)

• Small amount of read data is used

• Power wasted in maintaining/

accessing the array

Page

RA

S

CA

S

Today’s DRAM

• Activates one smaller page

• Fewer Read and write (refresh)

• Most of the read data is used

• IO can be widened to increase BW

Revised DRAM

Page Page Page

RA

S

CA

S

1K bytes

64 Bytes

256B or smaller

64 Bytes

Develop Innovative Packaging & IO Solutions

CPU with 100s

of cores

Heat Extraction

DRAM or

NVRAM die

Interconnect

substrate

Through

Silicon Via

CPU with

100s of cores

DRAM die

Through

Silicon Via

• Pins required + IO Power limits the use of traditional packaging

• Tighter integration between memory and CPU

• High BW and low latency using memory locality

Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008)

More cores per chip will slow some programs [red] unless there‟s a big boost in memory bandwidth [yellow].

“Multicore Is Bad News For Supercomputers”, IEEE Spectrum Nov 2008

Software Support Must Evolve To Minimize Data Movement

Exascale Software Study: Software Challenges in Extreme Scale Systems

Locality Optimizations

SW Focused Compression

Create Self Aware Systems

In Conclusion

Multi and Many Core Systems are here to stay

Memory System is the bottleneck

Energy constrains platform performance Data movement consumes much of that energy

New memory technologies and software support must evolve to reduce data movement

Scaling new memory technologies based on resistance becoming attractive

New memory characteristics drive potential new memory hierarchies

challenges and opportunities - intel · challenges and opportunities fayé a briggs, phd intel...

Documents