1
Eager Writeback Eager Writeback — — A Technique for A Technique for Improving Bandwidth UtilizationImproving Bandwidth Utilization
Hsien-Hsin LeeHsien-Hsin Lee Gary Tyson Gary Tyson Matt FarrensMatt Farrens
Intel Corporation, Santa ClaraIntel Corporation, Santa Clara
University of Michigan, Ann ArborUniversity of Michigan, Ann Arbor
University of California, Davis University of California, Davis
2Hsien-Hsin Lee MICRO-33
AgendaAgenda
IntroductionMemory Type and Bandwidth IssuesMemory Reference CharacterizationEager WritebackExperimental Results and AnalysisConclusions
3Hsien-Hsin Lee MICRO-33
Modern Multimedia Computing Modern Multimedia Computing SystemSystem
Command and Texture Traffics
System Memory (DRAM)
GraphicsProcessing
UnitChipset
CacheL2
Texture dataLocalFrameBuffer
Back-Side Bus
Front-Side Bus
Core Processor
The Host Processor
I/O I/O I/O
A.G.P.
Commands Data
4Hsien-Hsin Lee MICRO-33
Memory Type Support Memory Type Support Page-based programmable memory
types– Uncacheable (e.g. memory-mapped I/O)– Write-Combining (e.g. frame buffers)– Write-Protected (e.g. copy-on-write
when fork)– Write-Through– Write-Back or Copy-Back
5Hsien-Hsin Lee MICRO-33
Write-through vs. Write-through vs. WritebackWriteback
CPU
L1$
MainMemory
allocate
writes Reads
CPU
L1$
MainMemory
allocate
writes
Dirtywrites
Reads
6Hsien-Hsin Lee MICRO-33
Potential WB Bandwidth Potential WB Bandwidth Issues Issues Conflict on the bus while streaming data in
– Incoming : Demand fetches– Outgoing : Dirty Data
Dirty data – Can steal cycles amid successive data streaming– Delay of data delivery for critical path– Writeback (Castout) buffer could be ineffective
How to alleviate the conflicts ?– Try to find balance between WT and WB
7Hsien-Hsin Lee MICRO-33
Probability of Rewrites to Dirty Probability of Rewrites to Dirty Lines Lines
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MRU MRU - 1 LRU + 1 LRU
L1 data cache
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MRU MRU - 1 LRU + 1 LRU
L2 cache
Xlock-mount
POV-ray
xdoom
Xanim
Average
4-way caches using x-benchmark [Austin 98][Austin 98]
Pr(R|D) = # re-dirty / # dirty lines entering a particular LRU state MRU lines are much more likely to be written
8Hsien-Hsin Lee MICRO-33
Normalized L1 Dirty Line Normalized L1 Dirty Line StatesStates
Enter dirty the first time a line is written Re-dirty writing to a dirty line
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
xlock pov-ray xdoom xanim
MRU # enter dirty
MRU # re-dirty
MRU-1 # enter dirty
MRU-1 # re-dirty
LRU+1 # enter dirty
LRU+1 # re-dirty
LRU # enter dirty
LRU # re-dirty
9Hsien-Hsin Lee MICRO-33
Eager Writeback TriggerEager Writeback Trigger
Dirty lines enter LRU state !
10Hsien-Hsin Lee MICRO-33
Eager Writeback Eager Writeback MechanismMechanism
way0
Writeback Buffer
MSHR
Next Level Cache/Memory
BlockAddr
BlockAddr
Data
Data
set0
Set-Associative Cache
LRU bits
Cache Miss Address
DataReturn
Data
Forward
Path
01
11Hsien-Hsin Lee MICRO-33
Eager Writeback Eager Writeback MechanismMechanism
way0
Writeback Buffer
MSHR
Next Level Cache/Memory
BlockAddr
BlockAddr
Data
Data
set0
Set-Associative Cache
LRU bits
Cache Miss Address
DataReturn
Data
Forward
Path
00
12Hsien-Hsin Lee MICRO-33
Eager Writeback Eager Writeback MechanismMechanism
way0
Writeback Buffer
MSHR
Next Level Cache/Memory
BlockAddr
BlockAddr
Data
Data
set0
Set-Associative Cache
LRU bits
Cache Miss Address
DataReturn
Data
Forward
Path
00
13Hsien-Hsin Lee MICRO-33
Eager Writeback Eager Writeback MechanismMechanism
way0
Writeback Buffer
MSHR
Next Level Cache/Memory
BlockAddr
BlockAddr
Data
Data
set0
Set-Associative Cache
LRU bits
Cache Miss Address
DataReturn
Data
Forward
Path
00
X
14Hsien-Hsin Lee MICRO-33
Eager Writeback Eager Writeback MechanismMechanism
way0
Writeback Buffer
MSHR
Next Level Cache/Memory
BlockAddr
BlockAddr
Data
Data
set0
Set-Associative Cache
LRU bits
Cache Miss Address
DataReturn
Data
Forward
Path
00
Eager Queue (EQ)
set IDs
Set ID
Trigger when entry freed
15Hsien-Hsin Lee MICRO-33
Simulation FrameworkSimulation Framework Simplescalar suite 8-wide OOO superscalar machine Enhanced memory subsystem modeling Non-blocking caches (32KB L1 / 512 KB L2)
– Model MSHRs for all cache levels– Model WC memory type
2-level Gshare (10-bit) branch predictor RDRAM model (single-channel) Model limited bus bandwidth
– peak front-side bus bandwidth = 1.6 GB/s
16Hsien-Hsin Lee MICRO-33
Simulation FrameworkSimulation Frameworkparameters specs
Core Frequency 1GHz
BSB speed 500MHz
FSB speed 200MHz
LSQ size 32
RUU size 64
1st level caches 3clks / 1clk
cache port 2
2nd level cache 18clks / 10clks
TLBs 2clks / 1clk
BSB arbitration 4 clks
FSB arbitration 10 clks
RDRAM Trcd 20 clks
RDRAM Tcac 20 clks
RDRAM Trp 20 clks
17Hsien-Hsin Lee MICRO-33
Case StudiesCase Studies 3D Geometry Engine
– A triangle-based rendering algorithm– Used in Microsoft Direct3D and SGI OpenGL
Streaming
Xform
Light
Driver
3D model
DriverBuffer
To AGP memory
Geom engine
18Hsien-Hsin Lee MICRO-33
Bandwidth ShiftingBandwidth Shifting (Geometry (Geometry Engine)Engine)1.6GB/s
0
Execution time
Baseline Writeback
0.6GB/s
Eager Writeback1.6GB/s
0
0.4GB/s
19Hsien-Hsin Lee MICRO-33
Load Response TimeLoad Response Time
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1e+06
Execution time
Vert
ex I
D
Eager Writeback
Baseline Writeback
e.g. 600kth load
20Hsien-Hsin Lee MICRO-33
Performance of Geometry Performance of Geometry EngineEngine
Free writeback represents performance upper bound
0.900
0.950
1.000
1.050
1.100
1.150
1.200
NL, wb = 1 NL, wb = 4 NL, wb=256 3DL, wb = 1 3DL, wb = 4 3DL, wb=256
Baseline EQ = 0 EQ = 4 free dirty WB
21Hsien-Hsin Lee MICRO-33
Eager Writeback
Baseline Writeback
Bandwidth Filling Bandwidth Filling (Streaming)(Streaming)
1.6GB/s
0
1.6GB/s
0
Execution time
22Hsien-Hsin Lee MICRO-33
Performance of Streaming Performance of Streaming BenchmarkBenchmark
0.90
0.95
1.00
1.05
1.10
1.15
Stream wb = 1 Stream wb = 4 Stream wb =2 56
Baseline EQ = 0 EQ = 4 Free dirty WB
23Hsien-Hsin Lee MICRO-33
ConclusionsConclusions Writebacks compete bandwidth with demand
misses Demand data delivery can be deferred LRU dirty lines are rarely promoted again Eager writeback
– Triggered by dirty lines entering LRU state– Additional programmable memory type– Shift writeback traffic – Effective for content-rich apps, e.g. 3D
geometry – Can be extended for
• Improving context switch penalty • Reducing coherency misse latencies for MP systems
(similar technique: LTP [LaiFalsafi 00][LaiFalsafi 00] )
24
Questions & AnswersQuestions & Answers
Bandwidth problem can be cured with money. Latency problems are harder because the speed of light is fixed you cannot bribe God.
David Clark, MIT
25
That's all, folks !!!That's all, folks !!!
http://www.eecs.umich.edu/~linear
26
Backup FoilsBackup Foils
27Hsien-Hsin Lee MICRO-33
Speedup with Traffic Speedup with Traffic InjectionInjection
Imitating bandwidth stealing from other bus agents Uniform memory traffic injection
0.90 0.95 1.00 1.05 1.10 1.15 1.20
speedup
0B/s (No Injection)
400MB/s (160B/400clks)
800MB/s (320B/400clks)
1.2 GB/s (480B/400clks)
400MB/s (1280B/3200clks)
800MB/s (2560B/3200clks)
1.2 GB/s (3840B/3200clks)
28Hsien-Hsin Lee MICRO-33
Injected Memory Traffic Injected Memory Traffic (0.8GB/s)(0.8GB/s)
Execution time
1.6GB/s
0
320B/400 clks
1.6GB/s
0
2560B/3200 clks