a case for subarray-level parallelism (salp) in dram yoongu kim, vivek seshadri, donghyuk lee, jamie...

48
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Upload: bryce-strickland

Post on 24-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

A Case for Subarray-Level Parallelism

(SALP) in DRAM

Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Page 2: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

2

Executive Summary• Problem: Requests to same DRAM bank are

serialized• Our Goal: Parallelize requests to same DRAM

bank at a low cost• Observation: A bank consists of subarrays

that occassionally share global structures • Solution: Increase independence of

subarrays to enable parallel operation• Result: Significantly higher performance and

energy-efficiency at low cost (+0.15% area)

Page 3: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

3

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 4: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

4

Introduction

Bank

DRAM

Bank

Bank

Bank

Req

Req

Req

Req

Req Req Req Req

Bank conflict!

4x latency

Page 5: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

5

Three Problems1. Requests are serialized2. Serialization is worse after write requests3. Thrashing in row-buffer

Bank conflicts degrade performance

RowRow

RowRow

Bank

Row-Buffer

ReqReqReq

Thrashing: increases latency

Page 6: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

6

Case Study: Timeline

timeWr Rd

Wr Rdtime

Bank

time

Bank

Bank

• Case #1. Different Banks

• Case #2. Same Bank

1. Serialization

Wr Wr Rd RdWr 2 Wr 2 Rd RdWr 2 Wr 2 Rd Rd3 3 3

2. Write Penalty3. Thrashing Row-Buffer

Served in parallel Delayed

Page 7: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

7

Our Goal• Goal: Mitigate the detrimental effects of

bank conflicts in a cost-effective manner

• Naïve solution: Add more banks– Very expensive

• We propose a cost-effective solution

Page 8: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

8

A DRAM bank is divided into subarrays

Key Observation #1

Row

Row-Buffer

RowRowRow

32k rows

Logical Bank

A single row-buffer cannot drive all rows

Global Row-Buf

Physical Bank

Local Row-Buf

Local Row-BufSubarray1

Subarray64

Many local row-buffers, one at each subarray

Page 9: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

9

Key Observation #2Each subarray is mostly independent… – except occasionally sharing global structures

Global Row-Buf

Glo

bal D

ecod

er

Bank

Local Row-Buf

Local Row-BufSubarray1

Subarray64

···

Page 10: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

10

Key Idea: Reduce Sharing of Globals

Global Row-Buf

Glo

bal D

ecod

er

Bank

Local Row-Buf

Local Row-Buf

···

1. Parallel access to subarrays

2. Utilize multiple local row-buffers

Page 11: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

11

Overview of Our Mechanism

··· ReqReq

Global Row-Buf

Local Row-Buf

Req

Local Row-Buf

Req1. Parallelize2. Utilize multiple

local row-buffers

Subarray64

Subarray1

To same bank...but diff. subarrays

Page 12: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

12

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 13: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

13

DRAM System

Organization of DRAM System

Bank

Rank

Bank

RankChannel

Bus CPU

Page 14: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

14

1. More channels: expensive2. More ranks: low performance3. More banks: expensive

Naïve Solutions to Bank Conflicts

DRAM SystemChannel

Channel

Channel

Channel

Bus

Bus

Bus

Bus

Many CPU pins

Channel

R RR RLow frequency

ChannelRank

Bank

Significantly increases DRAM die area

Large load

Page 15: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

15

data

Logical Bank

RowRowRowRow

wordlines

bitlines

PrechargedState

ActivatedState

000

ACTIVATE

PRECHARGE

addrD

ecod

er VDD

?

Row-Buffer RD/WR

0

Total latency: 50ns!

Page 16: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

16

Physical Bank

Row-Buffer

32k

row

s

very long bitlines:hard to drive

Global Row-Buf

Local Row-Buf

Local Row-BufSubarray1

···

Local bitlines:short

512

row

s

Subarray64

Page 17: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

17

Hynix 4Gb DDR3 (23nm) Lim et al., ISSCC’12Ba

nk0

Bank

1

Bank

2

Bank

3 Subarray SubarrayDecoder

Tile

Magnified

Bank

5

Bank

6

Bank

7

Bank

8

Page 18: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

18

Bank: Full Picture

Global Row-Buf

Local Row-Buf

Local Row-Buf

···

Local bitlines

Subarray64

Subarray1

Local bitlines

Global bitlinesBankG

loba

l Dec

oder

SubarrayDecoderLa

tch

Page 19: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

19

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 20: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

20

Problem Statement

··· ReqReq

Global Row-Buf

Local Row-Buf

Local Row-Buf

Serialized!

To different subarrays

Page 21: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

21

MASA (Multitude of Activated Subarrays)Overview: MASA

···addr

VDD

addrG

loba

l Dec

oder

VDD

Local Row-Buf

Local Row-BufACTIVATED

Global Row-BufACTIVATED

READREAD

Challenges: Global Structures

Page 22: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

22

Challenges: Global Structures1. Global Address Latch

2. Global Bitlines

Page 23: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

23

Localrow-buffer

Localrow-bufferGlobalrow-buffer

Challenge #1. Global Address Latch

···addr

VDD

addr

Glo

bal D

ecod

er

VDD

Latc

hLa

tch

Latc

h PRECHARGED

ACTIVATED

ACTIVATED

Page 24: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

24

Localrow-buffer

Localrow-bufferGlobalrow-buffer

Solution #1. Subarray Address Latch

···

VDD

Glo

bal D

ecod

er

VDD

Latc

hLa

tch

Latc

h ACTIVATED

ACTIVATED

Page 25: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

25

Challenges: Global Structures1. Global Address Latch• Problem: Only one raised wordline• Solution: Subarray Address Latch

2. Global Bitlines

Page 26: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

26

Challenge #2. Global Bitlines

Localrow-buffer

Local row-buffer

Switch

Switch

READ

Global bitlines

Global row-buffer

Collision

Page 27: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

27

Wire

Solution #2. Designated-Bit LatchGlobal bitlines

Global row-buffer

Localrow-buffer

Local row-buffer

Switch

Switch

READREAD

DD

DD

Page 28: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

28

Challenges: Global Structures1. Global Address Latch• Problem: Only one raised wordline• Solution: Subarray Address Latch

2. Global Bitlines• Problem: Collision during access• Solution: Designated-Bit Latch

Page 29: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

29

• Baseline (Subarray-Oblivious)

• MASA

MASA: Advantages

timeWr 2 Wr 2 Rd Rd3 3 3

1. Serialization

2. Write Penalty 3. Thrashing

timeWr

Wr

Rd

Rd

Saved

Page 30: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

30

MASA: Overhead• DRAM Die Size: Only 0.15% increase– Subarray Address Latches– Designated-Bit Latches & Wire

• DRAM Static Energy: Small increase– 0.56mW for each activated subarray– But saves dynamic energy

• Controller: Small additional storage– Keep track of subarray status (< 256B)– Keep track of new timing constraints

Page 31: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

31

Cheaper Mechanisms

D

D

Latches

1. S

eria

lizati

on

2. W

r-Pe

nalty

3. T

hras

hing

MASA

SALP-2

SALP-1

Page 32: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

32

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 33: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

33

Related Works• Randomized bank index [Rau ISCA’91, Zhang+ MICRO’00, …]– Use XOR hashing to generate bank index– Cannot parallelize bank conflicts

• Rank-subsetting [Ware+ ICCD’06, Zheng+ MICRO’08, Ahn+ CAL’09, …]– Partition rank and data-bus into multiple subsets– Increases unloaded DRAM latency

• Cached DRAM [Hidaka+ IEEE Micro’90, Hsu+ ISCA’93, …]– Add SRAM cache inside of DRAM chip– Increases DRAM die size (+38.8% for 64kB)

• Hierarchical Bank [Yamauchi+ ARVLSI’97]– Parallelize accesses to subarrays– Adds complex logic to subarrays– Does not utilize multiple local row-buffers

Page 34: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

34

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 35: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

35

Methodology• DRAM Area/Power– Micron DDR3 SDRAM System-Power Calculator– DRAM Area/Power Model [Vogelsang, MICRO’10]– CACTI-D [Thoziyoor+, ISCA’08]

• Simulator– CPU: Pin-based, in-house x86 simulator– Memory: Validated cycle-accurate DDR3 DRAM simulator

• Workloads– 32 Single-core benchmarks• SPEC CPU2006, TPC, STREAM, random-access• Representative 100 million instructions

– 16 Multi-core workloads• Random mix of single-thread benchmarks

Page 36: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

36

Configuration• System Configuration– CPU: 5.3GHz, 128 ROB, 8 MSHR– LLC: 512kB per-core slice

• Memory Configuration– DDR3-1066– (default) 1 channel, 1 rank, 8 banks, 8 subarrays-per-bank– (sensitivity) 1-8 chans, 1-8 ranks, 8-64 banks, 1-128 subarrays

• Mapping & Row-Policy– (default) Line-interleaved & Closed-row– (sensitivity) Row-interleaved & Open-row

• DRAM Controller Configuration– 64-/64-entry read/write queues per-channel– FR-FCFS, batch scheduling for writes

Page 37: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

37

Single-Core: Instruction Throughput

hmm

erle

slie3

dze

usm

p

Gem

s.sp

hinx

3

scal

e

add

tria

d

gmea

n

0%10%20%30%40%50%60%70%80% MASA "Ideal"

IPC

Impr

ovem

ent

17%

20%

MASA achieves most of the benefit of having more banks (“Ideal”)

Page 38: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

38

Single-Core: Instruction Throughput

0%

10%

20%

30%

SALP-1 SALP-2MASA "Ideal"

IPC

Incr

ease

SALP-1, SALP-2, MASA improve performance at low cost

20%17%13%7%

DRAM Die Area

< 0.15% 0.15% 36.3%

Page 39: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

39

Single-Core: Sensitivity to Subarrays

1 2 4 8 16 32 64 1280%5%

10%15%20%25%30% MASA

Subarrays-per-bank

IPC

Impr

ovem

ent

You do not need many subarrays for high performance

Page 40: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

40

Single-Core: Row-Interleaved, Open-Row

0%

5%

10%

15%

20%

MASA "Ideal"IP

C In

crea

se

15%12%

MASA’s performance benefit is robust to mapping and page-policy

Page 41: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

41

Single-Core: Row-Interleaved, Open-Row

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Baseline MASA

Nor

mal

ized

D

ynam

ic E

nerg

y

0%

20%

40%

60%

80%

100%

Baseline MASA

Row

-Buff

er H

it-Ra

te

MASA increases energy-efficiency

-19%

+13%

Page 42: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

42

Other Results/Discussion in Paper• Multi-core results

• Sensitivity to number of channels & ranks

• DRAM die area overhead of:–Naively adding more banks–Naively adding SRAM caches

• Survey of alternative DRAM organizations–Qualitative comparison

Page 43: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

43

Conclusion• Problem: Requests to same DRAM bank are

serialized• Our Goal: Parallelize requests to same DRAM

bank at a low cost• Observation: A bank consists of subarrays

that occassionally share global structures • MASA: Reduces sharing to enable parallel

access and to utilize multiple row-buffers• Result: Significantly higher performance and

energy-efficiency at low cost (+0.15% area)

Page 44: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

A Case for Subarray-Level Parallelism

(SALP) in DRAMYoongu Kim, Vivek Seshadri,

Donghyuk Lee, Jamie Liu, Onur Mutlu

Page 45: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

45

Exposing Subarrays to Controller• Every DIMM has an SPD (Serial Presence Detect)

– 256-byte EEPROM– Contains information about DIMM and DRAM devices– Read by BIOS during system-boot

• SPD reserves 100+ bytes for manufacturer and user– Sufficient for subarray-related information

1. Whether SALP-1, SALP-2, MASA are supported2. Physical address bit positions for subarray index3. Values of timing constraints: tRA, tWA

(Image: JEDEC)

Page 46: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

46

Multi-Core: Memory SchedulingConfiguration: 8-16 cores, 2 chan, 2 ranks-per-chan

FRFCFS TCM FRFCFS TCM8-core system 16-core system

0%

5%

10%

15%

20%

25%Baseline SALP-1 SALP-2 MASA

WS

Incr

ease

Our mechanisms further improve performance when employed with application-aware schedulers

We believe it can be even greater with subarray-aware schedulers

Page 47: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

47

Number of Subarrays-Per-Bank• As DRAM chips grow in capacity…– More rows-per-bank More subarrays-per-bank

• Not all subarrays may be accessed in parallel– Faulty rows remapped to spare rows– If remapping occurs between two subarrays…• They can no longer be accessed in parallel

• Subarray group– Restrict remapping: only within a group of subarrays– Each subarray group can accessed in parallel– We refer to a subarray group as a “subarray”• We assume 8 subarrays-per-bank

Page 48: A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

48

Area & Power Overhead• Latches: Per-Subarray Row-Address, Designated-Bit

– Storage: 41 bits per subarray– Area: 0.15% in die area (assuming 8 subarrays-per-bank)– Power: 72.2uW (negligible)

• Multiple Activated Subarrays– Power: 0.56mW static power for each additional activated subarray

• Small compared to 48mW baseline static power

• SA-SEL Wire/Command– Area: One extra wire (negligible)– Power: SA-SEL consumes 49.6% the power of ACT

• Memory Controller: Tracking the status of subarrays– Storage: Less than 256 bytes

• Activated? Which wordline is raised? Designated?