introductionintroduction methodologymethodology level based hybrid cache architecturelevel based...

Hybrid Cache Architecture with Disparate Memory Technologies

Xiaoxia Wu† Jian Li‡ Lixin Zhang‡ Evan Speight‡ Ram Rajamony‡ Yuan Xie†† Pennsylvania State University

‡ IBM Austin Research Laboratory

Agenda• Introduction• Methodology• Level based Hybrid Cache Architecture• Region based Hybrid Cache Architecture• 3D Hybrid Cache Stacking• Conclusion

2/18

Introduction (1/3)• Traditional SRAM-based cache architecture

– Limited size with CMP: cache-core balance– Leakage power– More cache levels: Design overhead, coherence– Non-uniform Cache Architecture (wire delay)

• Improve cache power-performance with Emerging Memory Technologies, under the same chip area/footprint– Embedded DRAM– Magnetic RAM– Phase Change RAM– Three-dimensional space

3/18

Introduction (2/3)• Different Memory Technologies

4/23

SRAM(6T)

DRAM(1T 1C)

MRAM(1T 1J)

PRAM(1T 1J)

Density (ratio) Low (1) High (4) High (4) High (16)

Dynamic Power Low Medium Low for readHigh for write

Medium for read

High for write

Leakage Power High Medium Low Low

Speed Very fast Fast Fast for readSlow for write

Slow for readSlowest for

write

Non-volatility No No Yes Yes

Scalability Yes Yes Yes Yes

Endurance1610 1610 1510 810

Introduction (3/3)• Motivation

5/18

L2 Cache

Methodology (1/2)

Core / L1

L2(SRAM)

L3(SRAM)

L3(SRAM)

L2(SRAM)

Core / L1

Core / L1

L2(SRAM)

L3(SRAM)

L3(SRAM)

L2(SRAM)

Core / L1

Core / L1

L2(SRAM)

L3(SRAM)

L3(SRAM)

L2(SRAM)

Core / L1

Core / L1

L2(SRAM)

L3(SRAM)

L3(SRAM)

L2(SRAM)

Core / L1

Core / L1L2

(SRAM)

L3(SRAM)

Core / L1L2

(SRAM)L3

(eDRAMMRAMPRAM)

Core / L1L2

(SRAM)L2

(eDRAMMRAMPRAM)

L3 PRAM

L2 (SRAM)L2 (eDRAMMRAM)

Core / L1

L2 PRAM


Core / L1

L4 PRAM

L2 (SRAM)L2 (eDRAM

MRAM)

Core / L1

LHCA RHCA

3DHCA

(A) (B)

(C) (D) (E)

6/18

Methodology (2/2)

Cache Density Latency(cycle)

DynamicEnergy (nJ)

StaticPower

(W)

SRAM(1M) 1 8 0.388 1.36

eDRAM(4M) 4 24 0.72 0.4

MRAM(4M) 4 Read:20Write:60

Read: 0.4Write: 2.3 0.15

PRAM(16M) 16 Read:40Write:200

Read: 0.8Write: 1.5 0.3

Item Setting value

Processor 8-way issue out-of-order, 8-core, 4GhzL1 32KB DL1, 32KB IL1, 128B, 4-way, 1 R/W port

L2/L3/L4 eDRAM, MRAM, PRAM & 3D StackingMemory 400 cycles latency

• Benchmark: SpecInt06, Specjbb, NAS, Bioperf, Parsec

• Simulator: SystemSim full system simulatorBase line: 256KB (L2) + 1MB(L3)7/18

LHCA (Level based Hybrid Cache Architec-ture)

8/18

RHCA (Region based Hybrid Cache Architecture, 1/7)• Mutually exclusive regions• Parallel search unified LRU• Fast and slow regions in on cache level• Intra-cache data movement policy

– Move frequently used data to the fast region• Drowsy* RHCA

– Keep slow region in drowsy mode– The drowsy mode can be power-gating the non-volatile

memory cells and/or corresponding peripheral CMOS logic.

– It will be used the primitive drowsy mode for the DRAM.

Drowsy Mode: 캐시의 일정 성능을 유지하는 범위에서 소자에 전달되는 전력을 조절하여 전력 소모를 최적화 하는 방법 [15]9/18

RHCA (Region based Hybrid Cache Architecture, 2/7)• Intra-cache data movement policy

– On a cache hit, if the corresponding cache line resides in the fast region, its sticky bit is always set.

10/18

RHCA (Region based Hybrid Cache Architecture, 3/7)• Structure for swap operation.

11/18

RHCA (Region based Hybrid Cache Architecture, 4/7)

RHCA (fast+slow) Fast region L2 total size (latency)

SRAM+eDRAM 256KB (6 cycles) 4MB (24 cycles)

SRAM+MRAM 256KB (6 cycles) 4MB (r: 20, w: 60)

SRAM+PRAM 256KB (6 cycles) 16MB (r: 40, w: 200)

• Slow region: 256KB/bank, 1 r/w port, block size 128B, associativity:16, 16, 64

• RHCA is 256KB less size than corresponding LHCA– Avoid odd-sized cache

• DNUCA policy: more fine grained, move a line to a closer bank to CPU on each hit, bank-based, same size(Dynamic Non-Uniform Cache Architectures)

12/18

RHCA (Region based Hybrid Cache Architecture, 5/7)

eDRAM

MRAM

PRAM

Hit ratio

13/18

SRAM-eDRAM

RHCA (Region based Hybrid Cache Architecture, 6/7)• Multi-core

• Wake-up latency

14/17

SRAM-eDRAM

RHCA (Region based Hybrid Cache Architecture, 7/7)• Threshold

• Replacement and insertion policy

15/17

Baseline: LRU

3D Hybrid Cache Stacking

L3 PRAM


Core / L1

L2 PRAM


Core / L1

L4 PRAM

L2 (SRAM)L2 (eDRAM

MRAM)

Core / L1

(C) (D) (E)

• 3DHCA-C (3D LHCA): 256KB L2 SRAM, 4M L3 eDRAM, 32M L4 PRAM

• 3DHCA-D: 32M L2 fast, middle, slow region (3D RHCA)– Data in slow region can be moved to fast and middle re-

gions• 3DHCA-E: 4M L2 fast+slow region, 32M L3 PRAM

(LHCA+RHCA) 16/18

3D Hybrid Cache Stacking

17/18

Conclusion• Hybrid cache architecture is promising to improve

cache power-performance under same chip area/footprint

• RHCA and LHCA achieve better power-perfor-mance than SRAM-based design

• RHCA outperforms LHCA with minimal hardware support

• 3DHCA achieves better performance than LHCA and RHCA, while still maintains lower power than 2D SRAM baseline

18/18

introductionintroduction methodologymethodology level based hybrid cache architecturelevel based...

Documents