in-network cache coherence micro’2006 noel eisley et.al, princeton univ. presented by pak, eunji

17
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Upload: kathryn-floyd

Post on 03-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

In-network cache coher-ence

MICRO’2006Noel Eisley et.al, Princeton Univ.

Presented by PAK, EUNJI

Page 2: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

2

In-Network Cache Coherence

• Traditional Directory based Coherence• Decouple the protocol and the Communication medium

• Protocol optimizations for end-to-end communication• Network optimizations for reduced latency

• Known problems due to technology scaling• Wire delays• Storage overhead of directories (80 core CMPs)

• Expose the interconnection medium to the protocol• Directory = Centralized serializing point

• Distribute the directory with-in the network

Page 3: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

In-Network Read optimization

H A B

Read request

To sheerer dataDirectory based MSI

•Three end-to-end messages

In-Network Cache Coherence, Noel Eisley,

The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

Page 4: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

In-Network Read optimization

H A B

•Node B “bumps” into node A•While message in-transit to the home node H obtain the data directly from A

Read request

data In-Network MSI

In-Network Cache Coherence, Noel Eisley,

The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

Page 5: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

In-Network Write optimization

H A B

write request

Inv

InvDirectory based MSIAck

data

C

Ack

In-Network Cache Coherence, Noel Eisley,

The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

Page 6: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

In-Network Write optimization

H A B

write request

Inv+AckIn-network MSI

data

C

Ack +inv

•This in-transit optimization can reduce write communication from two round-trips to a single round-trip from C to H and back

In-Network Cache Coherence, Noel Eisley,

The 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

Page 7: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Current: Directory-Based Protocol

• Home node (H) keeps a list (or partial list) of sharers

• Read: Round trip over-head– Reads don’t take ad-

vantage of locality• Write: Up to two round

trip overhead– Invalidations ineffi-

cient

H

Page 8: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

New Approach: Virtual Network

• Network is not just for get-ting from A to B

• It can keep track of sharers and maintain coherence

• How? With trees stored within routers– Tree connecting home

node with sharers– Tree stored within

routers– On the way to home

node, when requests “bump” into trees, routers will re-route them towards sharers.

H

Bump into tree

Add to tree

Page 9: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

H

Reads Locate Data Efficiently

LegendH: Home node

Sharer node

Tree node (no data)

Read request/re-plyWrite request/reply

Teardown re-questAcknowledge-ment

• Read Request in-jected into the network

• Tree constructed as read reply is sent back

• New read in-jected into the network

• Request is redi-rected by the net-work

• Data is obtained from sharer and re-turned

Page 10: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Writes Invalidate Data Efficiently

H

• Write Request in-jected into the network

• In-transit invali-dations

• Acknowledge-ments spawned at leaf nodes

• Wait for ac-knowledgements

• Send out write reply

LegendH: Home node

Sharer node

Tree node (no data)

Read request/re-plyWrite request/reply

Teardown re-questAcknowledge-ment

Page 11: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Router Micro-architecture

Tree Cache

Link Tra-versal

Crossbar Traversal

Virtual Channel Arbitration

Physical Channel ArbitrationTree

Cache

Routing

Router Pipeline

Tag Data

En

try

H

Page 12: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Router Micro-architecture: Virtual Tree Cache

• Each cache line in the virtual tree cache contains– 2 bits (To, From root

node) for each di-rection (NSEW)

– 1 bit if the local node contains valid data

– 1 busy bit– 1 Outstanding re-

quest bit

Tag Data

En

try

T T T TF F F FN S E W D B R

Virtual Tree Cache

Page 13: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Methodology

• Benchmarks: SPLASH-2• Multithreaded simulator: Bochs running 16 or

64-way• Network and virtual tree simulator

– Trace-driven • Full-system simulation becomes prohibitive with many

cores• Motivation: explore scalability

– Cycle-accurate– Each message is modeled– Models network contention (arbitration)– Calculates average request completion times

BochsNetwork

Simulator

Applica-tion

Memory traces Memory access

latency perfor-mance

Page 14: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Performance Scalability

• Compare in-network vir-tual tree protocol to standard directory proto-col

• Good improvement– 4x4 Mesh: Avg. 35.5%

read latency reduc-tion, 41.2% write la-tency reduction

• Scalable– 8x8 Mesh: Avg. 35%

read and 48% write latency reduction

4x4 Mesh

8x8 Mesh

Page 15: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Storage Scalability

• Directory protocol: O(# Nodes)• Virtual tree protocol: O(# Ports) ~O(1)• Storage overhead

– 4x4 mesh: 56% more storage bits– 8x8 mesh: 58% fewer storage bits

16

64

# of Nodes Tree Cache Line

Directory Cache Line

28 bits

28 bits

18 bits

66 bits

Page 16: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Issues

• Power Dissipation• Major Bottleneck

• complex functionality in the routers

• Complexity• Complexity of routers, arbiters, etc. • Lack of verification tools

• CAD tool support• More research needed for ease of adoption

Page 17: In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI

Summary

• More and more cores on a CMP• Natural trend towards distribution of state• More interaction between interconnects and protocols

• Ring interconnects => medium size systems• Simple design, low cost• Limited Scalability

• In-Network coherence => large scale systems• Scalability• Severe implementation and power constraints on chip