lecture 9 ece/csc 506 - spring 2007 - e. f. gehringer, based on slides by yan solihin1 lecture 9...
Post on 19-Jan-2016
213 Views
Preview:
TRANSCRIPT
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
1
Lecture 9 Outline
MESI protocol Dragon update-based protocol Impact of protocol optimizations
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
2
Lower-Level Protocol Choices
BusRd observed in M state: what transition to make? Change to S: assume I’ll read again soon
good for mostly read data what about “migratory” data, thus:
Change to I: assume other will write to it (Synapse) I read and write, then you read and write, then X reads and
writes... Sequent Symmetry and MIT Alewife use adaptive protocols
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
3
MESI (4-state) Invalidation Protocol
Problem with MSI protocol Rd, Wr sequence incurs 2 transactions
even when no one is sharing (e.g., serial program!) BusRd (I S) followed by BusRdX or BusUpgr (S M) In general, coherence traffic from serial programs is unacceptable
Add exclusive state: Invalid Modified (dirty) Shared (two or more caches may have copies) Exclusive (only this cache has clean copy, same value as in memory)
How to decide I E or I S? Need to check whether someone else has copy “Shared” signal on bus: wired-or line asserted in response to BusRd
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
4
MESI: Processor-Initiated Transactions
M
S
E
PrRd/–PrWr/–
PrRd/–
PrWr/–
I
PrRd/BusRd(~S)
PrRd/BusRd(S)
PrWr/BusRdX
PrWr/BusRdX
PrRd/–
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
5
MESI: Bus-Initiated Transactions
M
I
E
BusRd/–BusRdX/–
S
BusRd/Flush BusRd/FlushBusRdX/Flush
BusRdX/Flush
BusRdX/Flush׳
BusRd/Flush׳
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
6
MESI State Transition Diagram
BusRd(S) means shared line asserted on BusRd transaction
PrWr/—
BusRd/Flush
PrRd/
BusRdX/Flush
PrWr/BusRdX
PrWr/—
PrRd/—
PrRd/—BusRd/Flush
E
M
I
S
PrRd
BusRd(S)
BusRdX/Flush
BusRdX/Flush
BusRd/Flush
PrWr/BusRdX
PrRd/BusRd (S)
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
7
Flush vs. Flush'
Flush: mandatory Flush' happens only when
Cache-to-cache sharing is used, and, Only one cache flushes data
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
8
MESI Visualization
P1 P3P2
Cache
Main Memory
BusSnooper Snooper Snooper
X=1
Mem Ctrl
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
9
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
rd &X
BusRd
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
10
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=1 E
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
11
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=1 E
wr &X(X=2)
M2
One less bus requestdue to Exclusive state,esp. for serial programs
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
12
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=2 M
rd &X
BusRd
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
13
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=2 M X=2 S
2
S
Flush
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
14
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=2
Mem Ctrl
X=2 S X=2 S
wr &XX=3
BusUpgr
I M3
Note: BusUpgr insteadof BusRdX
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
15
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=2
Mem Ctrl
X=2 I X=3
rd &X
BusRd
3
S3 M S
Flush
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
16
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=3
Mem Ctrl
X=3 S X=3 S
rd &X
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
17
MESI Visualization
P1 P3P2
Snooper Snooper Snooper
X=3
Mem Ctrl
X=3 S X=3 S
rd &X
BusRd
X=3 S
Referred to as Cache-to-cache transferin Illinois MESI protocol
Flush1
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
18
MESI Example (Cache-to-Cache Transfer)
* Data from memory if no cache2cache transfer, BusRd/-
Proc Action
State P1 State P2 State P3 Bus Action Data From
R1 E – – BusRd Mem
W1 M – – – Own cache
R3 S – S BusRd/Flush P1 cache
W3 I – M BusRdX Mem
R1 S – S BusRd/Flush P3 cache
R3 S – S – Own cache
R2 S S S BusRd/Flush׳׳P1/P3
Cache*
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
19
MESI Example (Cache-to-Cache Transfer+BusUpgr)
* Data from memory if no cache2cache transfer, BusRd/-
Proc Action
State P1 State P2 State P3 Bus Action Data From
R1 E - - BusRd Mem
W1 M - - - Own cache
R3 S - S BusRd/Flush P1 cache
W3 I - M BusUpgr Own cache
R1 S - S BusRd/Flush P3 cache
R3 S - S - Own cache
R2 S S S BusRd/Flush׳P1/P3
Cache*
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
20
Lower-Level Protocol Choices
Who supplies data on miss when not in M state: memory or cache? Original, lllinois MESI: cache
assume cache faster than memory (cache-to-cache transfer) Not necessarily true
Adds complexity How does memory know it should supply data? (must wait for caches) Selection algorithm if multiple caches have valid data
Valuable for distributed memory May be cheaper to obtain from nearby cache than distant memory Especially when constructed out of SMP nodes (Stanford DASH)
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
21
Lecture 9 Outline
MESI protocol Dragon update-based protocol Impact of protocol optimizations
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
22
Dragon Writeback Update Protocol
Four states Exclusive-clean (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I’m not owner Shared modified (Sm): I and others but not memory, and I’m the owner
Sm and Sc can coexist in different caches, with at most one Sm Modified or dirty (M): I and, no one else On replacement: Sc can silently drop, Sm has to flush
No invalid state If in cache, cannot be invalid If not present in cache, can view as being in not-present or invalid state
New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache
New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
23
Dragon: Processor-Initiated Transactions
E
M
Sc
Sm
PrRdMiss/BusRd(~S)
PrRd/–
PrWr/–
PrRd/–
PrWr/BusUpd(S)
PrWr/BusUpd(~S)
PrRdMiss/BusRd(S)
PrWrMiss/(BusRd(S);BusUpd)
PrRd/–
PrWr/BusUpd(~S)
PrRdMiss/BusRd(~S)
PrRd/–PrWr/BusUpd(S) PrWr/–
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
24
Dragon: Bus-Initiated Transactions
E
M
Sc
Sm
BusRd/–BusUpd/Update
BusRd/–
BusRd/Flush
BusUpd/Update
BusRd/Flush
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
25
Dragon State Transition Diagram
E Sc
Sm M
PrWr/—
PrRd/—
PrRd/—
PrRd/—
PrRdMiss/BusRd(S)
PrRdMiss/BusRd(S) PrWr/—
PrWrMiss/(BusRd(S); BusUpd)
PrWrMiss/BusRd(S)
PrWr/BusUpd(S)
PrWr/BusUpd(S)
BusRd/—
BusRd/Flush
PrRd/— BusUpd/Update
BusUpd/Update
BusRd/Flush
PrWr/BusUpd(S)
PrWr/BusUpd(S)
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
26
Dragon Visualization
P1 P3P2
Cache
Main Memory
BusSnooper Snooper Snooper
X=1
Mem Ctrl
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
27
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
rd &X
BusRd
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
28
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=1 E
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
29
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=1 E
wr &X(X=2)
M2
One less bus requestdue to Exclusive state,esp. for serial programs
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
30
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=2 M
rd &X
BusRd
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
31
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=2 M X=2 ScSm
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
32
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=2 Sm X=2 Sc
wr &XX=3
BusUpd
Sm3
Note: BusUpdate insteadof BusUpgr (no inval isperformed)
Sc3
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
33
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=3 Sc X=3
rd &X
Sm
This is a miss in theMESI and MSI protocols
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
34
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=3 Sc X=3 Sm
rd &X
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
35
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=3 Sc X=3 Sm
rd &X
BusRd
X=3 Sc
Note: Only the cache inState Sm is responsiblefor cache-to-cache transfer
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
36
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=3 Sc X=3 SmX=3 Sc
P1 replaces X
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
37
Dragon Visualization
P1 P3P2
Snooper Snooper Snooper
X=1
Mem Ctrl
X=3 Sc X=3 SmX=3 Sc
P3 replaces XOwner responsiblefor writing back to mem 3
vs. MSI or MESI wherewrite-back only when the line is in M state
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
38
Dragon Example
Proc Action
State P1 State P2 State P3 Bus Action Data From
R1 E – – BusRd Mem
W1 M – – – Own cache
R3 Sm – Sc BusRd/Flush P1 cache
W3 Sc – Sm BusUpd/Upd Own cache
R1 Sc – Sm – Own cache
R3 Sc – Sm – Own cache
R2 Sc Sc Sm BusRd/Flush P3 cache
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
39
Lower-Level Protocol Choices
Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn’t (assumes DRAM memory slow to update)
Should replacement of an Sc block be broadcast? Would allow last copy to go to Exclusive state and not generate updates Replacement bus transaction is not in critical path, later update may be
Shouldn’t update local copy on write hit before controller gets bus Can mess up serialization
Coherence, consistency considerations much like write-through case
In general, many subtle race conditions in protocols But first, let’s illustrate quantitative assessment at logical level
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
40
Lecture 9 Outline
MESI protocol Dragon update-based protocol Impact of protocol optimizations
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
41
Assessing Protocol Tradeoffs
Methodology: Use simulator; choose parameters per earlier methodology
(default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some)
Focus on frequencies, not end performance for now transcends architectural details, but not what we’re really
after Use idealized memory performance model to avoid
changes of reference interleaving across processors with machine parameters
Cheap simulation: no need to model contention
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
42
Impact of Protocol Optimizations
MSI = MESI Upgrades instead of read-exclusive helps Same story when working sets don’t fit for Ocean, Radix, Raytrace
MESI vs. MSI (w/ BusUpgr) vs. MSI (w/ BusRdX)Traffic (MB/s)
Traffic (MB/s)
x d
l t x
Ill
t Ex
0
20
40
60
80
100
120
140
160
180
200
Data bus
Address bus
E E0
10
20
30
40
50
60
70
80
Data bus
Address bus
Bar
nes/
III
Bar
nes/
3St
Bar
nes/
3St-
RdE
x
LU/I
II
Rad
ix/3
St-
RdE
x
LU/3
St
LU/3
St-
RdE
x
Rad
ix/3
St
Oce
an/I
II
Oce
an/
3S
Rad
iosi
ty/3
St-
RdE
x
Oce
an/3
St-
RdE
x
Rad
ix/I
II
Rad
iosi
ty/I
II
Rad
iosi
ty/3
St
Ray
trac
e/II
I
Ray
trac
e/3S
t
Ray
trac
e/3S
t-R
dEx
App
l-Cod
e/III
App
l-Cod
e/3S
t
App
l-Cod
e/3S
t-R
dEx
App
l-Dat
a/III App
l-Dat
a/3S
t
App
l-Dat
a/3S
t-R
dEx
OS
-Cod
e/III
OS
-Cod
e/3S
t
OS
-Dat
a/3S
t
OS
-Dat
a/III
OS
-Cod
e/3S
t-R
dEx
OS
-Dat
a/3S
t-R
dEx
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
43
Impact of Cache-Block Size
Multiprocessors add new kind of miss to cold, capacity, conflict Coherence misses: Due to invalidations
True sharing: Write to same word False sharing: Write to different words
Reducing misses architecturally in invalidation protocol Capacity: enlarge cache; increase block size (if spatial locality) Conflict: increase associativity Cold and coherence: only block size
Increasing block size has advantages and disadvantages Can reduce misses if spatial locality is good Can hurt too
increase misses due to false sharing if spatial locality not good increase misses due to conflicts in fixed-size cache increase traffic due to fetching unnecessary data and due to false sharing can increase miss penalty and perhaps hit cost
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
44
Impact of Block Size on Miss Rate For default problem size: vary block/line size from 8-256 Bytes
• Decreases with larger lines: cold, capacity (due to spatial locality), true sharing (due to spatial locality)• Increases with larger lines: false sharing • Working set doesn’t fit: impact of capacity misses large: (Ocean, Radix)
Cold
Capacity
True sharing
False sharing
Upgrade
8
0
0.1
0.2
0.3
0.4
0.5
0.6
Cold
Capacity
True sharing
False sharing
Upgrade
8 6 2 4 8 6 80
2
4
6
8
10
12
Mis
s ra
te (
%)
Bar
nes/
8
Bar
nes/
16
Bar
nes/
32
Bar
nes/
64
Bar
nes/
128
Bar
nes/
256
Lu/8
Lu/1
6
Lu/3
2
Lu/6
4
Lu/1
28
Lu/2
56
Rad
iosi
ty/8
Rad
iosi
ty/1
6
Rad
iosi
ty/3
2
Rad
iosi
ty/6
4
Rad
iosi
ty/1
28
Rad
iosi
ty/2
56
Mis
s ra
te (
%)
Oce
an/8
Oce
an/1
6
Oce
an/3
2
Oce
an/6
4
Oce
an/1
28
Oce
an/2
56
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2
Rad
ix/6
4
Rad
ix/1
28
Rad
ix/2
56
Ray
trac
e/8
Ray
trac
e/16
Ray
trac
e/32
Ray
trac
e/64
Ray
trac
e/12
8
Ray
trac
e/25
6
Lecture 9 ECE/CSC 506 - Spring 2007 - E. F. Gehringer, based on slides by Yan Solihin
45
Impact of Block Size on Traffic
Results different than for miss rate: traffic almost always increases When working sets fits, overall traffic still small, except for Radix Fixed overhead is significant component
So total traffic often minimized at 16-32 byte block, not smaller
Working set doesn’t fit: even 128-byte good for Ocean due to capacity Address bus traffic behaves in opposite way as the data bus traffic
Traffic (bytes/inst) affects performance indirectly through contentionTraffic (bytes/inst) affects performance indirectly through contention
Traf
fic (
byte
s/in
stru
ctio
n)
Traf
fic (
byte
s/F
LOP
)
Data bus
Address busData bus
Address bus
Rad
ix/8
Rad
ix/1
6
Rad
ix/3
2
Rad
ix/6
4
Rad
ix/1
28
Rad
ix/2
56
0
1
2
3
4
5
6
7
8
9
10
LU/8
LU/1
6
LU/3
2
LU/6
4
LU/1
28
LU/2
56
Oce
an/8
Oce
an/1
6
Oce
an/3
2
Oce
an/6
4
Oce
an/1
28
Oce
an/2
56
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 4 280
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Data bus
Address bus
Bar
nes/
16
Tra
ffic
(by
tes/
inst
ruct
ions
)
Bar
nes/
8
Bar
nes/
32
Bar
nes/
64
Bar
nes/
128
Bar
nes/
256
Rad
iosi
ty/8
Rad
iosi
ty/1
6
Rad
iosi
ty/3
2
Rad
iosi
ty/6
4
Rad
iosi
ty/1
28
Rad
iosi
ty/2
56
Ray
trac
e/8
Ray
trac
e/16
Ray
trac
e/32
Ray
trac
e/64
Ray
trac
e/12
8
Ray
trac
e/25
6
top related