introductionintroduction methodologymethodology level based hybrid cache architecturelevel based...
DESCRIPTION
3/18TRANSCRIPT
Hybrid Cache Architecture with Disparate Memory Technologies
Xiaoxia Wu† Jian Li‡ Lixin Zhang‡ Evan Speight‡ Ram Rajamony‡ Yuan Xie†† Pennsylvania State University
‡ IBM Austin Research Laboratory
Agenda• Introduction• Methodology• Level based Hybrid Cache Architecture• Region based Hybrid Cache Architecture• 3D Hybrid Cache Stacking• Conclusion
2/18
Introduction (1/3)• Traditional SRAM-based cache architecture
– Limited size with CMP: cache-core balance– Leakage power– More cache levels: Design overhead, coherence– Non-uniform Cache Architecture (wire delay)
• Improve cache power-performance with Emerging Memory Technologies, under the same chip area/footprint– Embedded DRAM– Magnetic RAM– Phase Change RAM– Three-dimensional space
3/18
Introduction (2/3)• Different Memory Technologies
4/23
SRAM(6T)
DRAM(1T 1C)
MRAM(1T 1J)
PRAM(1T 1J)
Density (ratio) Low (1) High (4) High (4) High (16)
Dynamic Power Low Medium Low for readHigh for write
Medium for read
High for write
Leakage Power High Medium Low Low
Speed Very fast Fast Fast for readSlow for write
Slow for readSlowest for
write
Non-volatility No No Yes Yes
Scalability Yes Yes Yes Yes
Endurance1610 1610 1510 810
Introduction (3/3)• Motivation
5/18
L2 Cache
Methodology (1/2)
Core / L1
L2(SRAM)
L3(SRAM)
L3(SRAM)
L2(SRAM)
Core / L1
Core / L1
L2(SRAM)
L3(SRAM)
L3(SRAM)
L2(SRAM)
Core / L1
Core / L1
L2(SRAM)
L3(SRAM)
L3(SRAM)
L2(SRAM)
Core / L1
Core / L1
L2(SRAM)
L3(SRAM)
L3(SRAM)
L2(SRAM)
Core / L1
Core / L1L2
(SRAM)
L3(SRAM)
Core / L1L2
(SRAM)L3
(eDRAMMRAMPRAM)
Core / L1L2
(SRAM)L2
(eDRAMMRAMPRAM)
L3 PRAM
L2 (SRAM)L2 (eDRAMMRAM)
Core / L1
L2 PRAM
L2 (SRAM)L2 (eDRAMMRAM)
Core / L1
L4 PRAM
L2 (SRAM)L2 (eDRAM
MRAM)
Core / L1
LHCA RHCA
3DHCA
(A) (B)
(C) (D) (E)
6/18
Methodology (2/2)
Cache Density Latency(cycle)
DynamicEnergy (nJ)
StaticPower
(W)
SRAM(1M) 1 8 0.388 1.36
eDRAM(4M) 4 24 0.72 0.4
MRAM(4M) 4 Read:20Write:60
Read: 0.4Write: 2.3 0.15
PRAM(16M) 16 Read:40Write:200
Read: 0.8Write: 1.5 0.3
Item Setting value
Processor 8-way issue out-of-order, 8-core, 4GhzL1 32KB DL1, 32KB IL1, 128B, 4-way, 1 R/W port
L2/L3/L4 eDRAM, MRAM, PRAM & 3D StackingMemory 400 cycles latency
• Benchmark: SpecInt06, Specjbb, NAS, Bioperf, Parsec
• Simulator: SystemSim full system simulatorBase line: 256KB (L2) + 1MB(L3)7/18
LHCA (Level based Hybrid Cache Architec-ture)
8/18
RHCA (Region based Hybrid Cache Architecture, 1/7)• Mutually exclusive regions• Parallel search unified LRU• Fast and slow regions in on cache level• Intra-cache data movement policy
– Move frequently used data to the fast region• Drowsy* RHCA
– Keep slow region in drowsy mode– The drowsy mode can be power-gating the non-volatile
memory cells and/or corresponding peripheral CMOS logic.
– It will be used the primitive drowsy mode for the DRAM.
Drowsy Mode: 캐시의 일정 성능을 유지하는 범위에서 소자에 전달되는 전력을 조절하여 전력 소모를 최적화 하는 방법 [15]9/18
RHCA (Region based Hybrid Cache Architecture, 2/7)• Intra-cache data movement policy
– On a cache hit, if the corresponding cache line resides in the fast region, its sticky bit is always set.
10/18
RHCA (Region based Hybrid Cache Architecture, 3/7)• Structure for swap operation.
11/18
RHCA (Region based Hybrid Cache Architecture, 4/7)
RHCA (fast+slow) Fast region L2 total size (latency)
SRAM+eDRAM 256KB (6 cycles) 4MB (24 cycles)
SRAM+MRAM 256KB (6 cycles) 4MB (r: 20, w: 60)
SRAM+PRAM 256KB (6 cycles) 16MB (r: 40, w: 200)
• Slow region: 256KB/bank, 1 r/w port, block size 128B, associativity:16, 16, 64
• RHCA is 256KB less size than corresponding LHCA– Avoid odd-sized cache
• DNUCA policy: more fine grained, move a line to a closer bank to CPU on each hit, bank-based, same size(Dynamic Non-Uniform Cache Architectures)
12/18
RHCA (Region based Hybrid Cache Architecture, 5/7)
eDRAM
MRAM
PRAM
Hit ratio
13/18
SRAM-eDRAM
RHCA (Region based Hybrid Cache Architecture, 6/7)• Multi-core
• Wake-up latency
14/17
SRAM-eDRAM
RHCA (Region based Hybrid Cache Architecture, 7/7)• Threshold
• Replacement and insertion policy
15/17
Baseline: LRU
3D Hybrid Cache Stacking
L3 PRAM
L2 (SRAM)L2 (eDRAMMRAM)
Core / L1
L2 PRAM
L2 (SRAM)L2 (eDRAMMRAM)
Core / L1
L4 PRAM
L2 (SRAM)L2 (eDRAM
MRAM)
Core / L1
(C) (D) (E)
• 3DHCA-C (3D LHCA): 256KB L2 SRAM, 4M L3 eDRAM, 32M L4 PRAM
• 3DHCA-D: 32M L2 fast, middle, slow region (3D RHCA)– Data in slow region can be moved to fast and middle re-
gions• 3DHCA-E: 4M L2 fast+slow region, 32M L3 PRAM
(LHCA+RHCA) 16/18
3D Hybrid Cache Stacking
17/18
Conclusion• Hybrid cache architecture is promising to improve
cache power-performance under same chip area/footprint
• RHCA and LHCA achieve better power-perfor-mance than SRAM-based design
• RHCA outperforms LHCA with minimal hardware support
• 3DHCA achieves better performance than LHCA and RHCA, while still maintains lower power than 2D SRAM baseline
18/18