a case for subarray-level parallelism (salp) in dram yoongu kim, vivek seshadri, donghyuk lee, jamie...

A Case for Subarray-Level Parallelism

(SALP) in DRAM

Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Executive Summary• Problem: Requests to same DRAM bank are

serialized• Our Goal: Parallelize requests to same DRAM

bank at a low cost• Observation: A bank consists of subarrays

that occassionally share global structures • Solution: Increase independence of

subarrays to enable parallel operation• Result: Significantly higher performance and

energy-efficiency at low cost (+0.15% area)

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Introduction

Req Req Req Req

Bank conflict!

4x latency

Three Problems1. Requests are serialized2. Serialization is worse after write requests3. Thrashing in row-buffer

Bank conflicts degrade performance

RowRow

Row-Buffer

ReqReqReq

Thrashing: increases latency

Case Study: Timeline

timeWr Rd

Wr Rdtime

• Case #1. Different Banks

• Case #2. Same Bank

1. Serialization

Wr Wr Rd RdWr 2 Wr 2 Rd RdWr 2 Wr 2 Rd Rd3 3 3

2. Write Penalty3. Thrashing Row-Buffer

Served in parallel Delayed

Our Goal• Goal: Mitigate the detrimental effects of

bank conflicts in a cost-effective manner

• Naïve solution: Add more banks– Very expensive

• We propose a cost-effective solution

A DRAM bank is divided into subarrays

Key Observation #1

Row-Buffer

RowRowRow

32k rows

Logical Bank

A single row-buffer cannot drive all rows

Global Row-Buf

Physical Bank

Local Row-Buf

Local Row-BufSubarray1

Subarray64

Many local row-buffers, one at each subarray

Key Observation #2Each subarray is mostly independent… – except occasionally sharing global structures

Global Row-Buf

Local Row-Buf

Subarray64

···

Key Idea: Reduce Sharing of Globals

Global Row-Buf

Local Row-Buf

···

1. Parallel access to subarrays

2. Utilize multiple local row-buffers

Overview of Our Mechanism

··· ReqReq

Global Row-Buf

Local Row-Buf

Req1. Parallelize2. Utilize multiple

local row-buffers

Subarray64

Subarray1

To same bank...but diff. subarrays

Outline

DRAM System

Organization of DRAM System

RankChannel

Bus CPU

1. More channels: expensive2. More ranks: low performance3. More banks: expensive

Naïve Solutions to Bank Conflicts

DRAM SystemChannel

Channel

Many CPU pins

Channel

R RR RLow frequency

ChannelRank

Significantly increases DRAM die area

Large load

Logical Bank

RowRowRowRow

wordlines

bitlines

PrechargedState

ActivatedState

ACTIVATE

PRECHARGE

er VDD

Row-Buffer RD/WR

Total latency: 50ns!

Physical Bank

Row-Buffer

very long bitlines:hard to drive

Global Row-Buf

Local Row-Buf

···

Local bitlines:short

Subarray64

Hynix 4Gb DDR3 (23nm) Lim et al., ISSCC’12Ba

3 Subarray SubarrayDecoder

Magnified

Bank: Full Picture

Global Row-Buf

Local Row-Buf

···

Local bitlines

Subarray64

Subarray1

Local bitlines

Global bitlinesBankG

SubarrayDecoderLa

Outline

Problem Statement

··· ReqReq

Global Row-Buf

Local Row-Buf

Serialized!

To different subarrays

MASA (Multitude of Activated Subarrays)Overview: MASA

···addr

Local Row-Buf

Local Row-BufACTIVATED

Global Row-BufACTIVATED

READREAD

Challenges: Global Structures

Challenges: Global Structures1. Global Address Latch

2. Global Bitlines

Localrow-buffer

Localrow-bufferGlobalrow-buffer

Challenge #1. Global Address Latch

···addr

h PRECHARGED

ACTIVATED

Localrow-buffer

Localrow-bufferGlobalrow-buffer

Solution #1. Subarray Address Latch

···

h ACTIVATED

ACTIVATED

Challenges: Global Structures1. Global Address Latch• Problem: Only one raised wordline• Solution: Subarray Address Latch

2. Global Bitlines

Challenge #2. Global Bitlines

Localrow-buffer

Local row-buffer

Switch

Global bitlines

Global row-buffer

Collision

Solution #2. Designated-Bit LatchGlobal bitlines

Global row-buffer

Localrow-buffer

Local row-buffer

Switch

READREAD

Challenges: Global Structures1. Global Address Latch• Problem: Only one raised wordline• Solution: Subarray Address Latch

2. Global Bitlines• Problem: Collision during access• Solution: Designated-Bit Latch

• Baseline (Subarray-Oblivious)

• MASA

MASA: Advantages

timeWr 2 Wr 2 Rd Rd3 3 3

1. Serialization

2. Write Penalty 3. Thrashing

timeWr

MASA: Overhead• DRAM Die Size: Only 0.15% increase– Subarray Address Latches– Designated-Bit Latches & Wire

• DRAM Static Energy: Small increase– 0.56mW for each activated subarray– But saves dynamic energy

• Controller: Small additional storage– Keep track of subarray status (< 256B)– Keep track of new timing constraints

Cheaper Mechanisms

Latches

lizati

SALP-2

SALP-1

Outline

Related Works• Randomized bank index [Rau ISCA’91, Zhang+ MICRO’00, …]– Use XOR hashing to generate bank index– Cannot parallelize bank conflicts

• Rank-subsetting [Ware+ ICCD’06, Zheng+ MICRO’08, Ahn+ CAL’09, …]– Partition rank and data-bus into multiple subsets– Increases unloaded DRAM latency

• Cached DRAM [Hidaka+ IEEE Micro’90, Hsu+ ISCA’93, …]– Add SRAM cache inside of DRAM chip– Increases DRAM die size (+38.8% for 64kB)

• Hierarchical Bank [Yamauchi+ ARVLSI’97]– Parallelize accesses to subarrays– Adds complex logic to subarrays– Does not utilize multiple local row-buffers

Outline

Methodology• DRAM Area/Power– Micron DDR3 SDRAM System-Power Calculator– DRAM Area/Power Model [Vogelsang, MICRO’10]– CACTI-D [Thoziyoor+, ISCA’08]

• Simulator– CPU: Pin-based, in-house x86 simulator– Memory: Validated cycle-accurate DDR3 DRAM simulator

• Workloads– 32 Single-core benchmarks• SPEC CPU2006, TPC, STREAM, random-access• Representative 100 million instructions

– 16 Multi-core workloads• Random mix of single-thread benchmarks

Configuration• System Configuration– CPU: 5.3GHz, 128 ROB, 8 MSHR– LLC: 512kB per-core slice

• Memory Configuration– DDR3-1066– (default) 1 channel, 1 rank, 8 banks, 8 subarrays-per-bank– (sensitivity) 1-8 chans, 1-8 ranks, 8-64 banks, 1-128 subarrays

• Mapping & Row-Policy– (default) Line-interleaved & Closed-row– (sensitivity) Row-interleaved & Open-row

• DRAM Controller Configuration– 64-/64-entry read/write queues per-channel– FR-FCFS, batch scheduling for writes

Single-Core: Instruction Throughput

0%10%20%30%40%50%60%70%80% MASA "Ideal"

MASA achieves most of the benefit of having more banks (“Ideal”)

Single-Core: Instruction Throughput

SALP-1 SALP-2MASA "Ideal"

SALP-1, SALP-2, MASA improve performance at low cost

20%17%13%7%

DRAM Die Area

< 0.15% 0.15% 36.3%

Single-Core: Sensitivity to Subarrays

1 2 4 8 16 32 64 1280%5%

10%15%20%25%30% MASA

Subarrays-per-bank

You do not need many subarrays for high performance

Single-Core: Row-Interleaved, Open-Row

MASA "Ideal"IP

15%12%

MASA’s performance benefit is robust to mapping and page-policy

Single-Core: Row-Interleaved, Open-Row

Baseline MASA

MASA increases energy-efficiency

Other Results/Discussion in Paper• Multi-core results

• Sensitivity to number of channels & ranks

• DRAM die area overhead of:–Naively adding more banks–Naively adding SRAM caches

• Survey of alternative DRAM organizations–Qualitative comparison

Conclusion• Problem: Requests to same DRAM bank are

serialized• Our Goal: Parallelize requests to same DRAM

bank at a low cost• Observation: A bank consists of subarrays

that occassionally share global structures • MASA: Reduces sharing to enable parallel

access and to utilize multiple row-buffers• Result: Significantly higher performance and

energy-efficiency at low cost (+0.15% area)

A Case for Subarray-Level Parallelism

(SALP) in DRAMYoongu Kim, Vivek Seshadri,

Donghyuk Lee, Jamie Liu, Onur Mutlu

Exposing Subarrays to Controller• Every DIMM has an SPD (Serial Presence Detect)

– 256-byte EEPROM– Contains information about DIMM and DRAM devices– Read by BIOS during system-boot

• SPD reserves 100+ bytes for manufacturer and user– Sufficient for subarray-related information

1. Whether SALP-1, SALP-2, MASA are supported2. Physical address bit positions for subarray index3. Values of timing constraints: tRA, tWA

(Image: JEDEC)

Multi-Core: Memory SchedulingConfiguration: 8-16 cores, 2 chan, 2 ranks-per-chan

FRFCFS TCM FRFCFS TCM8-core system 16-core system

25%Baseline SALP-1 SALP-2 MASA

Our mechanisms further improve performance when employed with application-aware schedulers

We believe it can be even greater with subarray-aware schedulers

Number of Subarrays-Per-Bank• As DRAM chips grow in capacity…– More rows-per-bank More subarrays-per-bank

• Not all subarrays may be accessed in parallel– Faulty rows remapped to spare rows– If remapping occurs between two subarrays…• They can no longer be accessed in parallel

• Subarray group– Restrict remapping: only within a group of subarrays– Each subarray group can accessed in parallel– We refer to a subarray group as a “subarray”• We assume 8 subarrays-per-bank

Area & Power Overhead• Latches: Per-Subarray Row-Address, Designated-Bit

– Storage: 41 bits per subarray– Area: 0.15% in die area (assuming 8 subarrays-per-bank)– Power: 72.2uW (negligible)

• Multiple Activated Subarrays– Power: 0.56mW static power for each additional activated subarray

• Small compared to 48mW baseline static power

• SA-SEL Wire/Command– Area: One extra wire (negligible)– Power: SA-SEL consumes 49.6% the power of ACT

• Memory Controller: Tracking the status of subarrays– Storage: Less than 256 bytes

• Activated? Which wordline is raised? Designated?

a case for subarray-level parallelism (salp) in dram yoongu kim, vivek seshadri, donghyuk lee, jamie...

Documents

an oppositional salp swarm: jaya algorithm for thermal

improving dram performance by parallelizing … dram...

salp/krill interactions in the southern ocean:spatial

donghyuk lee yoongu kim vivek seshadri jamie liu lavanya...

donghyuk lee yoongu kim vivek seshadri jamie liu lavanya...

certificato - salp · design, construction, maintenance and...

effect of accumulators on cavitation surge in...

1 tiered-latency dram: a low latency and a low cost dram...

sage authorized learning services partner(salp) program ......

sales dept./ general manager donghyuk, shin mobile:...

doris ground network€¦ · 4 salp mission • one of the...

salp products specification – volume 30 : jason-3 user...

a case for subarray -level parallelism (salp) in dram

salp distribution and size composition in the atlantic

architecting an energy-efﬁcient dram system for...

salp products specification – volume 10 : jason-2 user

cs380 lab i opengl donghyuk kim reference1. [opengl course...

salp schedule of main modifications - west suffolk

circular antenna array synthesis using salp swarm...

solutionscourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=midterm1... ·...