lecture 14: dram and prefetching. dram = dynamic ram sram: 6t per bit –built with normal...
TRANSCRIPT
2
SRAM vs. DRAM• DRAM = Dynamic RAM
• SRAM: 6T per bit– built with normal high-speed CMOS technology
• DRAM: 1T per bit– built with special DRAM process optimized for
density
Lecture 14: DRAM and Prefetching
4
Implementing the Capacitor• You can use a “dead” transistor gate:
Lecture 14: DRAM and Prefetching
But this wastes area becausewe now have two transistors
And the “dummy” transistor may needto be bigger to hold enough charge
5
Implementing the Capacitor (2)• There are other advanced structures
Lecture 14: DRAM and Prefetching
“Trench Cell”
Cell Plate Si
Cap Insulator
Storage Node Poly
Field Oxide
Refilling Poly
Si Substrate
DRAM figures from this slide and previous were taken fromProf. Nikolic’s EECS141/2003 Lecture notes from UC-Berkeley
6
DRAM Chip Organization
Lecture 14: DRAM and Prefetching
Row
Deco
der
Sense Amps
Column Decoder
MemoryCell Array
Row Buffer
RowAddress
ColumnAddress
Data Bus
7
DRAM Chip Organization (2)• High-Level organization is very similar to
SRAM– cells are only single-ended
• changes precharging and sensing circuits• makes reads destructive: contents are erased after
reading– row buffer
• read lots of bits all at once, and then parcel them out based on different column addresses
– similar to reading a full cache line, but only accessing one word at a time
• “Fast-Page Mode” FPM DRAM organizes the DRAM row to contain bits for a complete page
– row address held constant, and then fast read from different locations from the same page
Lecture 14: DRAM and Prefetching
8
Destructive Read
Lecture 14: DRAM and Prefetching
1
Vdd
Wordline Enabled
Sense Amp Enabled
bitlinevoltage
Vdd
storagecell voltage
sense amp
0
After read of 0 or 1, cell containssomething close to 1/2
9
Refresh• So after a read, the contents of the DRAM
cell are gone• The values are stored in the row buffer• Write them back into the cells for the next
read in the future
Lecture 14: DRAM and Prefetching
Row Buffer
Sense Amps
DRAM cells
10
Refresh (2)• Fairly gradually, the DRAM
cell will lose its contents even if it’s not accessed– This is why it’s called
“dynamic”– Contrast to SRAM which is
“static” in that once written, it maintains its value forever (so long as power remains on)
• All DRAM rows need to be regularly read and re-written
Lecture 14: DRAM and Prefetching
1
Gate Leakage
0
11
DRAM Read Timing
Lecture 14: DRAM and Prefetching
Accesses areasynchronous:
triggered by RAS andCAS signals, which
can in theory occur atarbitrary times (subject
to DRAM timingconstraints)
12
SDRAM Read Timing
Lecture 14: DRAM and Prefetching
Burst Length
Double-Data Rate (DDR) DRAMtransfers data on both rising and
falling edge of the clock
Timing figures taken from “A Performance Comparison of ContemporaryDRAM Architectures” by Cuppu, Jacob, Davis and Mudge
Command frequencydoes not change
13
More Latency
Lecture 14: DRAM and Prefetching
Width/Speed variesdepending on memory type
Significant wire delay just getting fromthe CPU to the memory controller
More wire delay gettingto the memory chips
(plus the return trip…)
14
MemoryController
Memory Controller
Lecture 14: DRAM and Prefetching
Scheduler Buffer
Bank 0 Bank 1
CommandsData
ReadQueue
WriteQueue
ResponseQueue
To/From CPU
Like Write-Combining Buffer,Scheduler may coalesce multipleaccesses together, or re-order toreduce number of row accesses
15
Wire-Dominated Latency (2)• Access latency dominated by wire delay
– mostly in the wordline and bitlines/sense– PCB traces between chips
• Process technology improvements provide smaller and faster transistors– DRAM density doubles at about the same rate
as Moore’s Law– DRAM latency improves very slowly because
wire delay has not improved as fast as logic delay
Lecture 14: DRAM and Prefetching
16
Wire-Dominated Latency• CPUs
– frequency has increased at about 60% per year• DRAM
– end-to-end latency has decreased only about 10% per year
Number of cycles for memory access keeps increasing– A.K.A. the memory wall– Note: absolute latency of memory is decreasing
• Just not nearly as fast as the CPU
Lecture 14: DRAM and Prefetching
17
So what do we do about it?• Caching
– reduces average memory instruction latency by avoiding DRAM altogether
• Limitations– Capacity
• programs keep increasing in size– Compulsory misses
Lecture 14: DRAM and Prefetching
18
Faster DRAM Speed• Clock FSB faster
– DRAM chips may not be able to keep up• Latency dominated by wire delay
– Bandwidth may be improved (DDR vs. regular) but latency doesn’t change much• Instead of 2 cycles for row access, may take 3 cycles
at a faster bus speed
• Doesn’t address latency of the memory access
Lecture 14: DRAM and Prefetching
19
On-Chip Memory Controller
Lecture 14: DRAM and Prefetching
All on same chip:No slow PCB wires to drive
Memory controller can runat CPU speed instead of
FSB clock speed
Disadvantage: memory type is nowtied to the CPU implementation
20
Prefetching• If memory takes a long time, start
accessing earlier
Lecture 14: DRAM and Prefetching
Load
L1 L2
Data
DRAM
Total Load-to-Use Latency
Prefetch DataLoad
Much improved Load-to-Use Latency
Somewhat improved Latency
May cause resourcecontention due toextra cache/DRAM
activity
21
Software Prefetching
Lecture 14: DRAM and Prefetching
A
CBR1 = [R2]
R3 = R1+4
(Cache missinginstruction in red)
A
CB
R3 = R1+4
R1 = [R2]
Hopefully the load missis serviced by the timewe get to the consumer
R1 = R1- 1R1 = R1- 1
Reordering canmess up your
code A
CBR1 = [R2]
R3 = R1+4
R0 = [R2]
Using a prefetch instruction(or load to $zero) can help
to avoid problems withdata dependencies
22
Software Prefetching (2)• Pros:
– can leverage compiler level information– no hardware modifications
• Cons:– prefetch instructions increase code footprint
• may cause more I$ misses, code alignment issues– hard to hoist prefetches early enough to cover
main memory latency• If memory is 100 cycles, and the CPU can sustain 2
instructions per cycle, then load needs to be moved 200 instructions earlier in the code
– aggressive hoisting leads to many useless prefetches• control flow may go somewhere else (like block B in
previous slide)Lecture 14: DRAM and Prefetching
23
Hardware Prefetching
Lecture 14: DRAM and Prefetching
DRAM
CPU
HWPrefetcher
Hardwaremonitors misstraffic to DRAM
Depending on prefetchalgorithm/miss patterns,
prefetcher injectsadditional memory
requests
Cannot be overly aggressivesince prefetches may contendfor memory bandwidth, andmay pollute the cache (evict
other useful cache lines)
24
Next-Line Prefetching• Very simple, if a request for cache line X
goes to DRAM, also request X+1– assumes spatial locality
• often a good assumption– low chance of tying up the memory bus for too
long• FPM DRAM already will have the correct page open for
the request for X, so X+1 will likely be available in the row buffer
• Can optimize by doing Next-Line-Unless-Crossing-A-Page-Boundary prefetching
Lecture 14: DRAM and Prefetching
25
Next-N-Line Prefetching• Obvious extension
– fetch the next N lines:• X+1, X+2, …, X+N
• Need to carefully tune N– larger N may make it:
• more likely to prefetch something useful• more likely to evict something useful• more likely to stall a useful load due to bus contention
Lecture 14: DRAM and Prefetching
26
Stream Buffers
Lecture 14: DRAM and Prefetching
Figures from Jouppi “Improving Direct-Mapped Cache Performance by theAddition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90
28
Stream Buffers• Can independently track multiple “inter-
twined” sequences/streams of accesses• Separate buffers prevent prefetch streams
from polluting cache until line is used at least once– similar effect to filter/promotion caches
• Can extend to “Quasi-Sequential” Stream buffer– add comparator to all entries, and skip-ahead
(partial flush) if hit on a non-head entry
Lecture 14: DRAM and Prefetching
29
Stride Prefetching
Lecture 14: DRAM and Prefetching
Column traversalof a matrix
Layout in linear memory
If array starts at address A, and we areaccessing the kth column, each elementis B bytes large, and there are N elementsper row of the matrix, then the addressesaccessed are:
A+Bk, A+Bk+N, A+Bk+2N, A+Bk+3N, …
Or, if you miss on address X, prefetch X+N
30
Stride Prefetching (2)• Like Next-N-Line prefetching, need to limit
how far ahead stride is allowed to go– previous example: no point in prefetching past
the end of the array
• How can you tell the difference between:– A[i] A[i+1]– X Y
– Typically only do stride prefetch if same stride observed at least a few times
Lecture 14: DRAM and Prefetching
31
Stride Prefetching (3)
Lecture 14: DRAM and Prefetching
What if we’re doing Y = A + X?
Miss traffic now looks like:
A+Bk, X+Bk, Y+Bk, A+Bk+N, X+Bk+N, Y+Bk+N, A+Bk+2N, X+Bk+2N, Y+Bk+2N, …
No detectable stride!(X-A)
(Y-X)
(A+N-Y)
32
PC-Based Stride
Lecture 14: DRAM and Prefetching
0x409A34 Load R1 = 0[R2]
0x409A50 Load R3 = 0[R4]
0x409A5CStore R5 = 0[R6]
A
Tag Addr Stride Count
X
Y
A+Bk+3N N 2
X+Bk+3N N 2
Y+Bk+2N N 1
<program is here>
If seen samestride enough
times(count > q)
+ PrefetchA=Bk+4N
33
Other Patterns
Lecture 14: DRAM and Prefetching
A B C D E F Linked-List Traversal
F
A B
C
D
E
Actual memorylayout
(no chance for stride toget this right)
34
F
Context-Sensitive Prefetching
Lecture 14: DRAM and Prefetching
What to Prefetch Next
D
F
A
B
C
E
A B
C
D
EE
?
B
C
D
F
Similar to history-based branch predictors:Last time I saw X, Y happened
Ex 1: X = taken branch, Y = not-taken
Ex 2: X = Missed A, Y = Missed B
35
Context-Sensitive Prefetching (2)• Like branch predictors, longer history
enables learning more complex patterns– and increases training time
Lecture 14: DRAM and Prefetching
A
B C
D E F G
DFS traversal: ABDBEBACFCGCA
A B
B D
D B
B E
E B
B A
A C
D
B
E
B
A
C
FPrefetch prediction
table
36
Markov Prefetching• Alternative to explicitly remembering the
patterns is to remember multiple next-states
Lecture 14: DRAM and Prefetching
D
F
A
B
C
E
B
C
B, C
D, E, A
F, G, A
B
A
B C
D E F G
GC
37
Pointer Prefetching
Lecture 14: DRAM and Prefetching
Miss to DRAM DRAM
1 4128 900120230 900120758
Cache line comes back
Scan for anything that looks like a pointer(is it within the heap range?)
Nope Nope Maybe! Maybe!
struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right;};
This allows you to walk the tree(or other pointer-based data structures
which are typically hard to prefetch)
Go ahead and prefetch these
38
Pointer Prefetching (2)• Don’t necessarily need extra hardware to
store patterns• Prefetch speed is slower:
Lecture 14: DRAM and Prefetching
X
X+N
X+2N
DRAM Latency
DRAM Latency
DRAM Latency
Stride Prefetcher
A DRAM Latency
B DRAM Latency
C DRAM Latency
Pointer Prefetching
See “Pointer-Cache Assisted Prefetching” by Collins et al. MICRO-2002for reducing this serialization effect.
39
Value-Prediction-Based Prefetching
• Takes advantage of value locality
• Mispredictions are less painful– Normal VPred
misprediction causes pipeline flush
– Misprediction of address just causes spurious memory accesses
Lecture 14: DRAM and Prefetching
Load PCValue Predictorfor address only
L1 L2
DR
AM
40
Evaluating Prefetchers• compare to simply increasing LLC size• complex prefetcher vs. simpler with slightly
larger cache
• metrics: performance, power, area, bus utilization– key is balancing prefetch aggressiveness with
resource utilization (reduce pollution, cache port contention, DRAM bus contention)
Lecture 14: DRAM and Prefetching
41
Where to Prefetch?• Prefetching can be done at any level of the
cache hierarchy• Prefetching algorithm may vary as well
– depends on why you’re having misses• capacity, conflict or compulsory
– may make capacity misses worse– simpler technique (victim cache) may be better for
conflict– has better chance than other techniques for compulsory
• behaviors vary by cache level, I$ vs. D$
Lecture 14: DRAM and Prefetching