lecture 14: dram and prefetching. dram = dynamic ram sram: 6t per bit –built with normal...

Advanced MicroarchitectureLecture 14: DRAM and Prefetching

2

SRAM vs. DRAM• DRAM = Dynamic RAM

• SRAM: 6T per bit– built with normal high-speed CMOS technology

• DRAM: 1T per bit– built with special DRAM process optimized for

density

Lecture 14: DRAM and Prefetching

3

Hardware Structures


b b

SRAM

wordline

b

DRAM

wordline

4

Implementing the Capacitor• You can use a “dead” transistor gate:


But this wastes area becausewe now have two transistors

And the “dummy” transistor may needto be bigger to hold enough charge

5

Implementing the Capacitor (2)• There are other advanced structures


“Trench Cell”

Cell Plate Si

Cap Insulator

Storage Node Poly

Field Oxide

Refilling Poly

Si Substrate

DRAM figures from this slide and previous were taken fromProf. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley

6

DRAM Chip Organization


Row

Deco

der

Sense Amps

Column Decoder

MemoryCell Array

Row Buffer

RowAddress

ColumnAddress

Data Bus

7

DRAM Chip Organization (2)• High-Level organization is very similar to

SRAM– cells are only single-ended

• changes precharging and sensing circuits• makes reads destructive: contents are erased after

reading– row buffer

• read lots of bits all at once, and then parcel them out based on different column addresses

– similar to reading a full cache line, but only accessing one word at a time

• “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain bits for a complete page

– row address held constant, and then fast read from different locations from the same page


8

Destructive Read


1

Vdd

Wordline Enabled

Sense Amp Enabled

bitlinevoltage

Vdd

storagecell voltage

sense amp

0

After read of 0 or 1, cell containssomething close to 1/2

9

Refresh• So after a read, the contents of the DRAM

cell are gone• The values are stored in the row buffer• Write them back into the cells for the next

read in the future


Row Buffer

Sense Amps

DRAM cells

10

Refresh (2)• Fairly gradually, the DRAM

cell will lose its contents even if it’s not accessed– This is why it’s called

“dynamic”– Contrast to SRAM which is

“static” in that once written, it maintains its value forever (so long as power remains on)

• All DRAM rows need to be regularly read and re-written


1

Gate Leakage

0

11

DRAM Read Timing


Accesses areasynchronous:

triggered by RAS andCAS signals, which

can in theory occur atarbitrary times (subject

to DRAM timingconstraints)

12

SDRAM Read Timing


Burst Length

Double-Data Rate (DDR) DRAMtransfers data on both rising and

falling edge of the clock

Timing figures taken from “A Performance Comparison of ContemporaryDRAM Architectures” by Cuppu, Jacob, Davis and Mudge

Command frequencydoes not change

13

More Latency


Width/Speed variesdepending on memory type

Significant wire delay just getting fromthe CPU to the memory controller

More wire delay gettingto the memory chips

(plus the return trip…)

14

MemoryController

Memory Controller


Scheduler Buffer

Bank 0 Bank 1

CommandsData

ReadQueue

WriteQueue

ResponseQueue

To/From CPU

Like Write-Combining Buffer,Scheduler may coalesce multipleaccesses together, or re-order toreduce number of row accesses

15

Wire-Dominated Latency (2)• Access latency dominated by wire delay

– mostly in the wordline and bitlines/sense– PCB traces between chips

• Process technology improvements provide smaller and faster transistors– DRAM density doubles at about the same rate

as Moore’s Law– DRAM latency improves very slowly because

wire delay has not improved as fast as logic delay


16

Wire-Dominated Latency• CPUs

– frequency has increased at about 60% per year• DRAM

– end-to-end latency has decreased only about 10% per year

Number of cycles for memory access keeps increasing– A.K.A. the memory wall– Note: absolute latency of memory is decreasing

• Just not nearly as fast as the CPU


17

So what do we do about it?• Caching

– reduces average memory instruction latency by avoiding DRAM altogether

• Limitations– Capacity

• programs keep increasing in size– Compulsory misses


18

Faster DRAM Speed• Clock FSB faster

– DRAM chips may not be able to keep up• Latency dominated by wire delay

– Bandwidth may be improved (DDR vs. regular) but latency doesn’t change much• Instead of 2 cycles for row access, may take 3 cycles

at a faster bus speed

• Doesn’t address latency of the memory access


19

On-Chip Memory Controller


All on same chip:No slow PCB wires to drive

Memory controller can runat CPU speed instead of

FSB clock speed

Disadvantage: memory type is nowtied to the CPU implementation

20

Prefetching• If memory takes a long time, start

accessing earlier


Load

L1 L2

Data

DRAM

Total Load-to-Use Latency

Prefetch DataLoad

Much improved Load-to-Use Latency

Somewhat improved Latency

May cause resourcecontention due toextra cache/DRAM

activity

21

Software Prefetching


A

CBR1 = [R2]

R3 = R1+4

(Cache missinginstruction in red)

A

CB

R3 = R1+4

R1 = [R2]

Hopefully the load missis serviced by the timewe get to the consumer

R1 = R1- 1R1 = R1- 1

Reordering canmess up your

code A

CBR1 = [R2]

R3 = R1+4

R0 = [R2]

Using a prefetch instruction(or load to $zero) can help

to avoid problems withdata dependencies

22

Software Prefetching (2)• Pros:

– can leverage compiler level information– no hardware modifications

• Cons:– prefetch instructions increase code footprint

• may cause more I$ misses, code alignment issues– hard to hoist prefetches early enough to cover

main memory latency• If memory is 100 cycles, and the CPU can sustain 2

instructions per cycle, then load needs to be moved 200 instructions earlier in the code

– aggressive hoisting leads to many useless prefetches• control flow may go somewhere else (like block B in

previous slide)Lecture 14: DRAM and Prefetching

23

Hardware Prefetching


DRAM

CPU

HWPrefetcher

Hardwaremonitors misstraffic to DRAM

Depending on prefetchalgorithm/miss patterns,

prefetcher injectsadditional memory

requests

Cannot be overly aggressivesince prefetches may contendfor memory bandwidth, andmay pollute the cache (evict

other useful cache lines)

24

Next-Line Prefetching• Very simple, if a request for cache line X

goes to DRAM, also request X+1– assumes spatial locality

• often a good assumption– low chance of tying up the memory bus for too

long• FPM DRAM already will have the correct page open for

the request for X, so X+1 will likely be available in the row buffer

• Can optimize by doing Next-Line-Unless-Crossing-A-Page-Boundary prefetching


25

Next-N-Line Prefetching• Obvious extension

– fetch the next N lines:• X+1, X+2, …, X+N

• Need to carefully tune N– larger N may make it:

• more likely to prefetch something useful• more likely to evict something useful• more likely to stall a useful load due to bus contention


26

Stream Buffers


Figures from Jouppi “Improving Direct-Mapped Cache Performance by theAddition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90

27

Stream Buffers (2)


28

Stream Buffers• Can independently track multiple “inter-

twined” sequences/streams of accesses• Separate buffers prevent prefetch streams

from polluting cache until line is used at least once– similar effect to filter/promotion caches

• Can extend to “Quasi-Sequential” Stream buffer– add comparator to all entries, and skip-ahead

(partial flush) if hit on a non-head entry


29

Stride Prefetching


Column traversalof a matrix

Layout in linear memory

If array starts at address A, and we areaccessing the kth column, each elementis B bytes large, and there are N elementsper row of the matrix, then the addressesaccessed are:

A+Bk, A+Bk+N, A+Bk+2N, A+Bk+3N, …

Or, if you miss on address X, prefetch X+N

30

Stride Prefetching (2)• Like Next-N-Line prefetching, need to limit

how far ahead stride is allowed to go– previous example: no point in prefetching past

the end of the array

• How can you tell the difference between:– A[i] A[i+1]– X Y

– Typically only do stride prefetch if same stride observed at least a few times


31

Stride Prefetching (3)


What if we’re doing Y = A + X?

Miss traffic now looks like:

A+Bk, X+Bk, Y+Bk, A+Bk+N, X+Bk+N, Y+Bk+N, A+Bk+2N, X+Bk+2N, Y+Bk+2N, …

No detectable stride!(X-A)

(Y-X)

(A+N-Y)

32

PC-Based Stride


0x409A34 Load R1 = 0[R2]

0x409A50 Load R3 = 0[R4]

0x409A5CStore R5 = 0[R6]

A

Tag Addr Stride Count

X

Y

A+Bk+3N N 2

X+Bk+3N N 2

Y+Bk+2N N 1

<program is here>

If seen samestride enough

times(count > q)

+ PrefetchA=Bk+4N

33

Other Patterns


A B C D E F Linked-List Traversal

F

A B

C

D

E

Actual memorylayout

(no chance for stride toget this right)

34

F

Context-Sensitive Prefetching


What to Prefetch Next

D

F

A

B

C

E

A B

C

D

EE

?

B

C

D

F

Similar to history-based branch predictors:Last time I saw X, Y happened

Ex 1: X = taken branch, Y = not-taken

Ex 2: X = Missed A, Y = Missed B

35

Context-Sensitive Prefetching (2)• Like branch predictors, longer history

enables learning more complex patterns– and increases training time


A

B C

D E F G

DFS traversal: ABDBEBACFCGCA

A B

B D

D B

B E

E B

B A

A C

D

B

E

B

A

C

FPrefetch prediction

table

36

Markov Prefetching• Alternative to explicitly remembering the

patterns is to remember multiple next-states


D

F

A

B

C

E

B

C

B, C

D, E, A

F, G, A

B

A

B C

D E F G

GC

37

Pointer Prefetching


Miss to DRAM DRAM

1 4128 900120230 900120758

Cache line comes back

Scan for anything that looks like a pointer(is it within the heap range?)

Nope Nope Maybe! Maybe!

struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right;};

This allows you to walk the tree(or other pointer-based data structures

which are typically hard to prefetch)

Go ahead and prefetch these

38

Pointer Prefetching (2)• Don’t necessarily need extra hardware to

store patterns• Prefetch speed is slower:


X

X+N

X+2N

DRAM Latency

DRAM Latency

DRAM Latency

Stride Prefetcher

A DRAM Latency

B DRAM Latency

C DRAM Latency

Pointer Prefetching

See “Pointer-Cache Assisted Prefetching” by Collins et al. MICRO-2002for reducing this serialization effect.

39

Value-Prediction-Based Prefetching

• Takes advantage of value locality

• Mispredictions are less painful– Normal VPred

misprediction causes pipeline flush

– Misprediction of address just causes spurious memory accesses


Load PCValue Predictorfor address only

L1 L2

DR

AM

40

Evaluating Prefetchers• compare to simply increasing LLC size• complex prefetcher vs. simpler with slightly

larger cache

• metrics: performance, power, area, bus utilization– key is balancing prefetch aggressiveness with

resource utilization (reduce pollution, cache port contention, DRAM bus contention)


41

Where to Prefetch?• Prefetching can be done at any level of the

cache hierarchy• Prefetching algorithm may vary as well

– depends on why you’re having misses• capacity, conflict or compulsory

– may make capacity misses worse– simpler technique (victim cache) may be better for

conflict– has better chance than other techniques for compulsory

• behaviors vary by cache level, I$ vs. D$


lecture 14: dram and prefetching. dram = dynamic ram sram: 6t per bit –built with normal...

Documents