cmp l2 cache management presented by: yang liu cps221 spring 2008 based on: optimizing replication,...
TRANSCRIPT
![Page 1: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/1.jpg)
CMP L2 Cache Management
Presented by: Yang Liu
CPS221 Spring 2008
Based on:
Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z. Chishti, M. Powell, and T. Vijaykumar
ASR: Adaptive Selective Replication for CMP Caches, B. Beckman, M. Marty, and D. Wood
![Page 2: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/2.jpg)
Outline
Motivation
Related Work (1) – Non-uniform Caches
CMP-NuRAPID
Related Work (2) – Replication Schemes
ASR
![Page 3: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/3.jpg)
Motivation
Two options for L2 caches in CMPs Shared: high latency because of wire delay Private: more misses because of replications
Need hybrid L2 caches
Take in mind On-chip communication is fast On-chip capacity is limited
![Page 4: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/4.jpg)
NUCA
Non-Uniform Cache Architecture Place frequently-accessed data closest to the
core to allow fast access Couple tag and data placement
Can only place one or two ways in each set close to the processor
![Page 5: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/5.jpg)
NuRAPID
Non-uniform access with Replacement And Placement usIng Distance associativity
Decouple the set-associative way number from data placement
Divide the cache data array into d-groups Use forward and reverse pointers
Forward: from tag to data Reverse: from data to tag One to one?
![Page 6: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/6.jpg)
CMP-NuRAPID - Overview
Hybrid private tag Shared data organization
Controlled Replication – CR In-Situ Communication – ISC Capacity Stealing – CS
![Page 7: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/7.jpg)
CMP-NuRAPID – Structure
Need carefully chosen d-group preference
![Page 8: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/8.jpg)
CMP-NuRAPID – Data and Tag Array Tag arrays snoop on bus to maintain coherence The data array is accessed through a crossbar
![Page 9: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/9.jpg)
CMP-NuRAPID – Controlled Replication For read-only sharing First use no copy, save capacity Second copy, reduce future access latency In total, avoid off-chip misses
![Page 10: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/10.jpg)
CMP-NuRAPID – Time Issues Start to read before the invalidation and end
after the invalidation Mark the tag for the block being read from a
farther d-group busy
Start to read after the invalidation begins and end before the invalidation completes Put an entry in the queue that holds the order of
the bus transaction before sending a read request to a farther d-group
![Page 11: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/11.jpg)
CMP-NuRAPID – In-situ Communication
For read-write sharing Communication state Write-through for all C blocks in L1 cache
![Page 12: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/12.jpg)
CMP-NuRAPID – Capacity Stealing Demote less-frequently-used data to unused
frames in the d-groups closer to the cores with less capacity demands
Placement and Promotion Place all private blocks in the d-group closest to
the initiating core Promote the block directly to the closest d-group
for the core
![Page 13: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/13.jpg)
CMP-NuRAPID – Capacity Stealing Demotion and Replacement
Demote the block to the next-fastest d-group Replace in the order of invalid, private, and shared
Doesn’t this kind of demotion pollute another core’s fastest d-group?
![Page 14: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/14.jpg)
CMP-NuRAPID - Methodology Simics 4-core CMP 8 MB, 8-way CMP-NuRAPID with 4 single-
ported d-groups Both multithreaded and multiprogrammed
workloads
![Page 15: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/15.jpg)
CMP-NuRAPID – Multithreaded
![Page 16: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/16.jpg)
CMP-NuRAPID – Multiprogrammed
![Page 17: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/17.jpg)
Replication Schemes
Cooperative Caching Private L2 caches Restrict replication under certain criteria
Victim Replication Share L2 cache Allow replication under certain criteria
Both have static replication policies How about dynamic?
![Page 18: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/18.jpg)
ASR - Overview
Adaptive Selective Replication
Dynamic cache block replication Replicate blocks when the benefits exceed
the costs Benefits: lower L2 hit latency Costs: More L2 misses
![Page 19: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/19.jpg)
ASR – Sharing Types
Shingle Requestor Blocks are accessed by a single processor
Shared Read-Only Blocks are read, but not written, by multiple processors
Shared Read-Write Blocks are accessed by multiple processors, with at least
one write
Focus on replicating shared read-only blocks High locality Little Capacity Large portion of requests
![Page 20: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/20.jpg)
ASR - SPR
Selective Probabilistic Replication Assume private L2 caches and selectively
limits replication on L1 evictions Use probabilistic filtering to make local
replication decisions
![Page 21: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/21.jpg)
ASR – Balancing Replication
![Page 22: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/22.jpg)
ASR – Replication Control
Replication levels C: Current H: Higher L: Lower
Cycles H: Hit cycles-per-instruction M: Miss cycles-per-instruction
![Page 23: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/23.jpg)
ASR – Replication Control
![Page 24: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/24.jpg)
ASR – Replication Control
Wait until there are enough events to ensure a fair cost/benefit comparison
Wait until four consecutive evaluation intervals predict the same change before change the replication level
![Page 25: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/25.jpg)
ASR – Designs Supported by SPR SPR-VR
Add 1-bit per L2 cache block to identify replicas Disallow replications when the local cache set is filled with
owner blocks with identified sharers SPR-NR
Store a 1-bit counter per remote processor for each L2 block
Remove the shared bus overhead (How?) SPR-CC
Model the centralized tag structure using an idealized distributed tag structure
![Page 26: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/26.jpg)
ASR - Methodology
Two CMP configurations – Current and Future 8 processors Writeback, write-allocate cache Both commercial and scientific workloads Use throughput as metrics
![Page 27: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/27.jpg)
ASR – Memory Cycles
![Page 28: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/28.jpg)
ASR - Speedup
![Page 29: CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z](https://reader036.vdocuments.us/reader036/viewer/2022062423/5697bfdf1a28abf838cb269c/html5/thumbnails/29.jpg)
Conclusion
Hybrid is better Dynamic is better
Need tradeoff
How does it scale…