eecs 570 lecture 12 dir. optimization & coma · lecture 12 eecs 570 slide 3 readings for today:...

28
Lecture 12 Slide 1 EECS 570 EECS 570 Lecture 12 Dir. Optimization & COMA Winter 2020 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve , Falsafi , Hill, Lebeck , Martin, Narayanasamy , Nowatzyk , Reinhardt, Roth, Smith, Singh, and Wenisch.

Upload: others

Post on 14-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 1EECS 570

EECS 570

Lecture 12

Dir. Optimization& COMA

Winter 2020

Prof. Satish Narayanasamy

http://www.eecs.umich.edu/courses/eecs570/

Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt, Roth, Smith, Singh, and Wenisch.

Page 2: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 2EECS 570

Announcements

Midterm Wednesday

5p – 6:30p

EECS 1003 (firstname: A-P)EECS 1005 (firstname: Q-Z)

Alternate exam: 1:30-3p (DOW 1010)

No cheat sheet.

Bring calculators.

Page 3: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 3EECS 570

Readings

For today: Gupta et al - Reducing Memory and Traffic Requirements for

Scalable Directory-Based Cache Coherence Schemes, 1990. Fredrik Dahlgren and Josep Torrellas. Cache-only memory

architectures. Computer 6 (1999): 72-79..

Page 4: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 4EECS 570

Designing a Directory Protocol:Nomenclature

• Local Node (L) Node initiating the transaction we care about

• Home Node (H) Node where directory/main memory for the block lives

• Remote Node (R) Any other node that participates in the transaction

Page 5: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 5EECS 570

Read Transaction

• L has a cache miss on a load instruction

L H

1: Get-S

2: Data

Page 6: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 6EECS 570

4-hop Read Transaction

• L has a cache miss on a load instruction Block was previously in modified state at R

L H

1: Get-S

4: Data

R

State: MOwner: R

2: Recall

3: Data

Page 7: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 7EECS 570

3-hop Read Transaction

• L has a cache miss on a load instruction Block was previously in modified state at R

L H

1: Get-S

3: Data

R

State: MOwner: R

2: Fwd-Get-S

3: Data

Page 8: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 8EECS 570

An Example Race: Writeback & Read

• L has dirty copy, wants to write back to H

• R concurrently sends a read to H

L H

1: Put-M+Data

5: Data

R

State: MOwner: L

2: Get-S

3: Fwd-Get-S

4:

Race ! Put-M & Fwd-Get-S

6:MIA

State: SD

Sharers: L,R

SIA

Race!Final State: S

7: Put-Ack

To make your head really hurt:

Can optimize away SIA & Put-Ack!

L and H each know the race happened, don’t need more msgs.

Page 9: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 9EECS 570

Store-Store Race

• Line is invalid, both L and R race to obtain write permission

L H

1: Get-M

6: Fwd-Get-M

R

State: MOwner: L

Get-M

4: Data [ack=0]

7:

Race! Stall for Data, do 1 store,

then Fwd to R

3:

Fwd-Get-M to L;New Owner: R

5:

8: Data [ack=0]

IMADIMAD

Page 10: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 10EECS 570

Worst-case scenario?

• L evicts dirty copy, R concurrently seeks write permission

L H

1: Put-M

6: Put-Ack

R

State: MOwner: L

2: Get-M

3: Fwd-Get-M

Race! Put-M floating around!

Wait till its gone…

5:

Put-M fromNonOwner: Race!

L waiting to ensure Put-M gone…

4: Data [ack=0]

MIAIIA

Page 11: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 11EECS 570

Design Principles

• Think of sending and receiving messages as separate events

• At each “step”, consider what new requests can occur E.g., can a new writeback overtake an older one?

• Two messages traversing same direction implies a race Need to consider both delivery orders

Usually results in a “branch” in coherence FSM to handle both orderings

Need to make sure messages can’t stick around “lost” Every request needs an ack; extra states to clean up messages

Often, only one node knows how a race resolves Might need to send messages to tell others what to do

Page 12: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 12EECS 570

CC Protocol Scorecard

• Does the protocol use negative acknowledgments (retries)?

• Is the number of active messages (sent but unprocessed) for one transaction bounded?

• Does the protocol require clean eviction notifications?

• How/when is the directory accessed during transaction?

• How many lanes are needed to avoid deadlocks?

Page 13: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 13EECS 570

NACKs in a CC Protocol

• Issues: Livelock, Starvation, Fairness

• NACKs as a flow control method (“home node is busy”) Really bad idea…

• NACKs as a consequence of protocol interaction…

L H

1: Put-M

5: Get-S NACK

R

State: MOwner: L

2: Get-S

3: Fwd-Get-S

4:

Race ! Put-M &

Fwd-Get-S

6:

Race!Final State: S

No need to Ack

Page 14: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 14EECS 570

Bounded # Msgs / Transaction

• Scalability issue: how much queue space is needed

• Coarse-vector vs. cruise-missile invalidation

L H

RRRRRR

2: Invalidate Req3: Invalidate Ack

1: ReadEx Req

L H

R R R

RRR

2: Invalidate Req

3: Invalidate Ack

1: ReadEx Req

Page 15: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 15EECS 570

Frequency of Directory Updates

• How to deal with transient states? Keep it in the directory: unlimited concurrency Keep it in a pending transaction buffer (e.g., transaction state

register file): faster, but limits pending transactions

• Occupancy free: Upon receiving an unsolicited request, can directory determine final state solely from current state?

Page 16: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 16EECS 570

Required # of lanes

• Need at least 2:

• More may be needed by I/O, complex forwarding

• How to assign lane to message type? Secondary (forced) requests must not be blocked by new

requests Replies (completing a pending transaction) must not be

blocked by new requests

L H

1: Get-M

4: Data

R

2: Inv

3: Inv-Ack

Page 17: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 17EECS 570

Some more guidelines

• All messages should be ack’d (requests elicit replies)

• Maximum number of potential concurrent messages for one transaction should be small and constant (i.e., independent of number of nodes in system)

• Anticipate ships passing in the night effect

• Use context information to avoid NACKs

Page 18: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 18EECS 570

Optimizing coherence protocols

Read A (miss)

L H R

Readlatency

Page 19: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 19EECS 570

Prefetching

Prefetch A

L H R

Read latency

Read A (miss)

Page 20: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 20EECS 570

3-hop reads

Read A (miss)

L H R

Read latency

Page 21: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 21EECS 570

3-hop writes

Store A (miss)

L H R

Store latency

R

Page 22: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 22EECS 570

Migratory Sharing

Node 1 Node 2 Node 3

Read XWrite X

Read XWrite X

Read XWrite X

• Each Read/Write pair results in read miss + upgrade miss

• Coherence FSM can detect this pattern Detect via back-to-back read-upgrade sequences Transition to “migratory M” state Upon a read, invalidate current copy, pass in “mig E” state

Page 23: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 23EECS 570

Producer Consumer Sharing

Node 1 Node 2 Node 3

Read XWrite X

Read XRead X

• Upon read miss, downgrade instead of invalidate Detect because there are 2+ readers between writes

• More sophisticated optimizations Keep track of prior readers Forward data to all readers upon downgrade

Read XWrite X

Read XRead X

Page 24: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 24EECS 570

Shortcomings of Protocol Optimizations

• Optimizations built directly into coherence state machine Complex! Adds more transitions, races Hard to verify even basic protocols Each optimization contributes to state explosion Can target only simple sharing patterns Can learn only one pattern per address at a time

Page 25: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 25EECS 570

Cache Only Memory Architecture (COMA)

Page 26: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 26EECS 570

Big Picture

• Centralized shared memory

• Uniform access

• Distributed Shared memory

• Non-uniform access latency

• No notion of “home” node; data moves to wherever

it is needed

• Individual memories behave

like caches

Page 27: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 27EECS 570

Cache Only Memory Architecture (COMA)

• Make all memory available for migration/replication

• All memory is DRAM cache called attraction memory

• Example systems Data Diffusion Machine KSR-1 (hierarchical snooping via ring interconnects) Flat COMA (fixed home node for directory, but not data)

• Key questions: How to find data? How to deal with replacements? Memory overhead

Page 28: EECS 570 Lecture 12 Dir. Optimization & COMA · Lecture 12 EECS 570 Slide 3 Readings For today: Gupta et al - Reducing Memory and Traffic Requirements for Scalable Directory-Based

Lecture 12 Slide 28EECS 570

COMA Alternatives

• Flat-COMA Blocks (data) are free to migrate Fixed directory location (home node) for a physical address

• Simple-COMA Allocation managed by OS and done at page granularity

• Reactive-NUMA Switches between S-COMA and NUMA with remote cache on

per-page basis