optimizing communication and capacity in 3d stacked cache hierarchies aniruddha udipi n. madan, l....
TRANSCRIPT
Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies
Aniruddha Udipi
N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S.
Makineni and D. Newell
University of Utah and Intel STL
University of Utah 2
Motivation
• Many-core designs requires large cache capacity for performance
• SRAM has low latency and consumes less power
• DRAM has 8X density but poor latency/power characteristics
• Can we design a hybrid SRAM-DRAM cache to take advantage of both technologies?
• Can we build a customized on-chip network specifically targeted at such a design?
University of Utah 3
Proposal - 3D Stacked Hybrid Cache
SRAM DRAM
• Not an option in conventional 2D design
• 3D Mixed-process stacking enables a single vertical SRAM/DRAM bank
University of Utah 4
Executive Summary
• 3D stacked hybrid cache design
• Synergistic proposals to improve performance and power efficiency
– Optimizing Capacity• Reconfigurable cache hierarchy
– Optimizing Communication• Page coloring for effective data placement - reduced
communication• Tailor-made on-chip interconnection network - quicker
communication
• Up to 62% performance increase
University of Utah 5
Outline
• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network• Evaluation• Conclusions
University of Utah 6
3D Technology
+ Mixed process integration possible
+ High speed vertical interconnects
- Thermal Issues
Source: Black et al. MICRO’06
Through-Silicon Vias (TSVs)
Die-to-die vias
Heat sink
Die #2
Die #1
Bulk Si #1
Active Si #1Metal
MetalActive Si #2Bulk Si #2
I/O Bumps
University of Utah 7
Baseline Model
Lower die - 16 Processing cores
Upper die - 16 SRAM Banks, with grid based on-chip network
University of Utah 8
Outline
• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - On-chip Interconnection Network • Technique III - Page coloring • Evaluation• Conclusions
University of Utah 9
Technique I - Reconfigurable hierarchy
• Increase capacity by stacking a DRAM bank on each SRAM cache bank, reconfigure bank size based on demand
• More compelling with 3D and NUCA– Space capacity on die 3 does not intrude with layout of second
die or steal capacity from neighboring caches– Cache already partitioned into NUCA banks, additional banks
do not complicate logic too much– Access time grows less than linearly with capacity– Dramatic increase in capacity, no gradation, only two choices
• Turn-off DRAM for small working set size
University of Utah 10
Proposed Reconfigurable Cache Model
Die containing 16 cores
Die containing 16SRAM banks and tree interconnect
Die containing 16DRAM banks and no interconnect
Inter-die via pillarto send request from
core to L2 SRAM(not shown: one
pillar for each core)
Inter-die via pillarto access portionof L2 in DRAM
(not shown: onepillar per sector)
University of Utah 11
• Simple heuristic for enabling/disabling DRAM bank: Every Reconfiguration Interval,
– If usage is low and cache-bank miss-rate is low disable DRAM bank above
– If usage is high and cache-bank miss-rate is high enable DRAM bank above
• Reconfiguration interval is every 10 million cycles
• All cores are stalled for 100K cycles during reconfiguration
Proposed Reconfiguration Policy
Cache Organization
University of Utah 12
Tag Array
Data Array
DRAM
SRAM
Adaptive arrays become tag arrays for ways in DRAM
Total Capacity
1 MB9 MB
Low HighAccess Pressure
Ways
Ways
4 2
0
32
Cache Organization
• SRAM banks have three memory arrays – tag array, data array, adaptive array (can act as both tag & data)
• Whenever DRAM banks are switched on, tags implemented in part of the SRAM
– Quick lookup of tag
• Increased capacity manifests as additional ways– Cache lines in SRAM need not be flushed on
reconfiguration
– Two ways of data available with low latency, moving MRU data to these ways will further increase efficiency
University of Utah 13
Why is this better than a L2/L3 hierarchy?
• Additional access penalty on L2 miss before the L3 is accessed to service the request
– In our scheme, we look up all tags in parallel, in the SRAM
• An additional level implies additional coherence complexity
• Our experiments show non-trivial performance degradation on implementing SRAM/DRAM as L2/L3 compared to our scheme
University of Utah 14
University of Utah 15
Outline
• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network • Evaluation• Conclusions
University of Utah 16
Technique II - Page Coloring
• OS can control what Physical Page Number is assigned to each virtual page, thus controlling the index
• It can be manipulated to redirect cache line placements
Cache Tag OffsetIndex
Physical Page Number Offset
Page Color
CACHE VIEW
PHYSICAL ADDRESS
University of Utah 17
Page Coloring
• Page coloring employed to map data to banks based on proximity to cores.
• We assume an offline oracle page-coloring implementation
• Policies depend upon 2 criteria:– Knowledge of a page being private or shared– Knowledge of a page being data or code
• More capacity pressure on banks carrying shared data
University of Utah 18
Proposed Page Coloring Schemes
Share4:D+I Rp:I+Share4:D Share16:D+I
Shared Data + Code
Private Page
Shared Data
Private Code
Shared data & code mapped to central 4 banks
Shared data to central 4 banks; code replicated
Shared data + code distributed to all 16 banks
University of Utah 19
Outline
• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network• Evaluation• Conclusions
University of Utah 20
Technique III - Interconnection network
TREE
Links
Router
Routers
saved!
On chip tree network
• Predictable traffic pattern
• Data moves between shared central banks/private overhead banks and the core
• Decreased router overhead
• Saves energy and time
University of Utah 21
University of Utah 22
Synergy between proposals
Page coloring
Tree network Hybrid 3D cache
- No search (S-NUCA)
- Radiating traffic pattern
- No spills into neighboring banks
- Increased bank capacity with low latency
University of Utah 23
Outline
• Overview of 3D Technology• Technique I - Reconfigurable Cache Hierarchy• Technique II - Page coloring• Technique III - On-chip Interconnection Network • Evaluation• Conclusions
University of Utah 24
Methodology
• Intel ManySim trace-based simulator• CACTI cache model for area, power and access latencies• HotSpot 4.0 for thermal evaluation
• 16 cores, 32nm process, 4GHz clock • 4KB page granularity • 1MB SRAM bank and 8MB DRAM bank
• SAP, SPECjbb, TPC-C and TPC-E commercial multi-threaded workload traces
University of Utah 25
Workload Characterization
Sharing Characterization
0
10
20
30
40
50
60
70
80
90
100
SAP SPECJBB TPCC TPCE AVG
Workload
Percentage Sharing
CODE
DATA
• Working set size of code pages is 0.6% of data pages
• Average code page access count is 57%
University of Utah 26
00.20.40.60.8
11.21.41.61.8
2
Page Coloring Schemes
Page Coloring Evaluation
Capacity constraint favors distributing shared pages
Code Replication favorable when capacity is available
University of Utah 27
Interconnect Evaluation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
BASE Share4:D+I Share4:D Share16:D+I
Percentage Bank Access
Local Sibling Distant
Network power savings up to 48%
Most accesses are local due to code
replication
Most accesses are random
University of Utah 28
Hybrid Cache Evaluation
00.20.40.60.8
11.21.41.61.8
Cores
SRAM
Cores
SRAM
SRAM
Cores
SRAM
DRAM
Base-No-PC Base-2x-No-PC Base-3- level
L2 L2
L2
L3
L2
Cores
SRAM
DRAM
Proposed Chip
L2
L2
Re-configurable Cache (with code replication) performs 55% better than Base-1
~ 5% IPC drop, to get power savings
University of Utah 29
SRAM-DRAM Hits without Reconfiguration
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cache BankDRAM
SRAM
Most accesses are to SRAM ways except in shared banks (5,6,9,10)
SRAM-DRAM Hits with Reconfiguration
University of Utah 30
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cache Bank
University of Utah 31
Reconfiguration Policy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Bank Number
SAP
TPCC
TPCE
SPEC
Shared Banks have DRAM always enabled
SPECJbb – DRAM always enabled – majority pages are private
University of Utah 32
Related Work
• Reconfigurable Caches in 2D – Ranganathan et al. (ISCA ‘00), Balasubramonian et al. (MICRO ‘00),
Zhang et al. (ISCA ‘03)• 3D Cache hierarchy
– Lie et al. (IEEE D&T ‘05), Loi et al. (DAC ‘06),
Kgil et al. (ASPLOS ‘06), Loh (ISCA ‘08) • Page coloring for NUCA
– Cho et al. (MICRO ‘06), Awasthi et al. (HPCA’09), Chaudhuri (HPCA ‘09)
• 3D NUCA interconnect – Li et al. (ISCA ‘06)
• Our is the first paper to propose SRAM/DRAM, targeted tree network, and combining all these into a 3D hierarcy
University of Utah 33
Key Contributions
• A synergistic cache design
• Communication- and capacity-optimized 3D cache – Reconfigurable cache to improve performance while reducing
power– OS-based page coloring for reduced communication– Tailor-made on-chip network for quicker communication
• Significant increase in efficiency– Performance improvement of up to 62%– Network power savings of up to 48%
• Typical thermal effect +7 Celsius
University of Utah 34
Thank you..
• Questions?