![Page 1: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/1.jpg)
Uncorq:
Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors
http://iacoma.cs.uiuc.edu
Karin Strauss AMD Advanced Architecture and Technology Lab
Xiaowei Shen IBM Research
Josep Torrellas University of Illinois at Urbana-Champaign
![Page 2: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/2.jpg)
Karin Strauss - “Uncorq” 2QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Motivation
• CMPs are ubiquitous
• Shared memory + caches = cache coherence
• directory-based: indirection, storage
• shared bus-based: electrical, layout issues
• Traditional cache coherence solutions
![Page 3: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/3.jpg)
Karin Strauss - “Uncorq” 3QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Embedded-ring cache coherence [ISCA 2006]
• logical ring is embedded in network
• control messages use ring
Simple and inexpensive to implement
• Novel snoopy cache coherence for mid-sized machines
• data messages use any path
Snoop requests can have long latencies
![Page 4: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/4.jpg)
Karin Strauss - “Uncorq” 4QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Contributions
• Propose invariant for transaction serialization
• Propose performance enhancements
• reduces cache-to-cache transfer latency
• Uncorq: unconstrained snoop request delivery
• Simple hardware data prefetching technique
• reduces memory-to-cache transfer latency
![Page 5: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/5.jpg)
Karin Strauss - “Uncorq” 5QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
logical ring
Embedded-ring terminology
• snoop request• snoop response• snoop request + response
• data
• Types of messages:
+requestresponse
snoop op. outcome
positive snoop op. outcome
+ positive response
data
A B
control messages
• Snoopy, invalidate protocol
responserequest
• Single supplier protocol
request
![Page 6: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/6.jpg)
Ordering invariant
![Page 7: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/7.jpg)
Karin Strauss - “Uncorq” 7QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Transaction serialization
inv
inv
ack
ack
read
data S
MSI
S A BS I S
timeinv
I
old value new value
![Page 8: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/8.jpg)
Karin Strauss - “Uncorq” 8QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Serialization enforcement with embedded-ring
• Logical unidirectional ring provides partial ordering
• Distributed algorithm establishes global order for same-address transactions
• one is declared the “winner” (first to reach supplier)• others have to retry
• On simultaneous transactions to same address:
A
requestrequest requestrequestresponseresponse responseresponse
![Page 9: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/9.jpg)
Karin Strauss - “Uncorq” 9QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
How to serialize transactions
A
B
S
A’s request and response
B’s request and response
No clear “first” transaction
B’s request reaches S first
A receives B’s positive response before its own
A retries: B A
Ring guarantees responsesare forwarded in the order Sperformed snoop operations
+
+
![Page 10: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/10.jpg)
Karin Strauss - “Uncorq” 10QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Enforcing transaction serialization
Ordering Invariant: the order in which responses travel the ring after leaving the supplier must be the same as the order in which the supplier processed their corresponding requests.
• Node whose request arrives at supplier node first is the “winner”
• What we need to enforce transaction serialization:
loser node sees other node’s positive response before its own
+
S
+request
response
![Page 11: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/11.jpg)
Uncorq:Unconstrained snoop
request delivery
![Page 12: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/12.jpg)
Karin Strauss - “Uncorq” 12QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Uncorq idea
Baselinerequest
response
Idea: requests do not have to follow the ring(but responses do)
Uncorq
![Page 13: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/13.jpg)
Karin Strauss - “Uncorq” 13QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Benefit of Uncorq
Baseline
Reduced cache-to-cache transfer latency
Uncorq
savings
request snoop data
request reachessupplier node
time
![Page 14: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/14.jpg)
Karin Strauss - “Uncorq” 14QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Implications of Uncorq
• Uncorq no longer restricts order of requests
• Nodes may receive and process requests in any order
• Responses may also get reordered
Problem: distributed algorithm relies on the fact that response order reflects order of requests at supplier
![Page 15: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/15.jpg)
Karin Strauss - “Uncorq” 15QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Example: incorrect transaction ordering
A
B
S
S
++ +
S
Ordering invariant
A node cannot forward any other response ifit has an outstanding positive snoop outcome
request
response
![Page 16: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/16.jpg)
Karin Strauss - “Uncorq” 16QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
How Uncorq stalls responses
+ +
addr Crequestsresponses
+
• Local transaction table (per-node structure)
A B …C
• records messages that node is currently processing
+request
response
![Page 17: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/17.jpg)
Karin Strauss - “Uncorq” 17QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Optimization: prefetching from memory
• Predict when no node will supply data
• Access memory in parallel with ring snoop
R
memory(1)
(2)
R
memory
(1)
(1)
• Goal: reduce latency of memory-to-cache transfers
unoptimized optimized
![Page 18: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/18.jpg)
Evaluation
![Page 19: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/19.jpg)
Karin Strauss - “Uncorq” 19QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• 64 nodes in a single CMP
Experimental setup
• SESC simulator (sesc.sourceforge.net)
• SPLASH-2, SPECjbb and SPECweb workloads
• Interconnection network: 2D torus with embedded-ring
![Page 20: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/20.jpg)
Karin Strauss - “Uncorq” 20QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Cache-to-cache transfer latency
substantial reduction in latencyUncorq
0
2
4
6
8
10
0 100 200 300 400 500 600
distribution (%)
0
20
40
60
80
100
cumulative
distribution (%)
cache-to-cache transfer latency
Baseline
0
2
4
6
8
10
0 100 200 300 400 500 600
distribution (%)
0
20
40
60
80
100
cumulative
distribution (%)
cache-to-cache transfer latency
![Page 21: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/21.jpg)
Karin Strauss - “Uncorq” 21QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Execution Time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SPLASH-2 SPECjbb SPECweb
normalized
execution time
BaselineUncorqUncorq+Pref
• Uncorq + Pref performs the best (reduction: 13-26%)
• Uncorq significantly reduces execution time (reduction: 5-23%)
![Page 22: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/22.jpg)
Karin Strauss - “Uncorq” 22QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Also in the paper
• Serialization mechanism for case with no supplier
• System and node forward progress
• Fences and memory consistency issues
• Characterization of prefetching mechanism
• Comparison against ccHyperTransport
![Page 23: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/23.jpg)
Karin Strauss - “Uncorq” 23QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Conclusion
• Propose invariant for transaction serialization
• Propose performance enhancements• Uncorq: unconstrained snoop request delivery
• Reduce execution time by 13-26%
• Simple hardware data prefetching technique
![Page 24: Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d405503460f94a1a456/html5/thumbnails/24.jpg)
Uncorq:
Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors
http://iacoma.cs.uiuc.edu
Karin Strauss AMD Advanced Architecture and Technology Lab
Xiaowei Shen IBM Research
Josep Torrellas University of Illinois at Urbana-Champaign