a coherence protocol for optimizing global shared data accesses
DESCRIPTION
A Coherence Protocol for Optimizing Global Shared Data Accesses. Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada Olivier Tardieu , IBM T. J. Watson, USA. Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/1.jpg)
1
A Coherence Protocol for Optimizing Global Shared Data Accesses
Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada
Olivier Tardieu, IBM T. J. Watson, USA
![Page 2: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/2.jpg)
2
Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming
![Page 3: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/3.jpg)
3
ReadWrite
…Node 1 Node N
-1
-1
4
-1
-1 -1
-1
4
-1
-1
Node 1 Node 2
Node 1 Node 2
Node 3Node 4
MonteCarlo Estimation of PI 5-Point Stencil Operations
Turing Ring Simulation
![Page 4: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/4.jpg)
4
Challenge: Minimize Communication Latency
![Page 5: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/5.jpg)
-1
-1
4
-1
-1
Node 1 Node 2
Ghost Cell Pattern for Data Sharing
Data payloadMessage id
Data payloadAddress
NetworkInterface
HostCPU
Memory
Two-sided Message
One-sided Message
Remote Direct Memory Access (RDMA)
Communication Optimization Techniques
![Page 6: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/6.jpg)
Communication Optimization Techniques
atomic at (p) sv();Transfer Referencing Task to SV Home
![Page 7: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/7.jpg)
Communication Optimization Techniques
atomic at (p) sv();Transfer Referencing Task to SV Home
atomic at (p) async sv();Remote Task Creation for SV Access
![Page 8: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/8.jpg)
…Node 1 Node N
Write-Once / Read-Mostly
Node 1 Node 1 Node 1 Node 1
…
Replication
![Page 9: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/9.jpg)
…Node 1 Node N
Write-Once / Read-Mostly
Node 1 Node 1 Node 1 Node 1
…
Replication
…Node 1 Node N
Result Object
…Node 1 Node N
Collecting Sum Reducer
![Page 10: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/10.jpg)
…Node 1 Node N
General Read-Write
SV state
SV state
SV state
SV state
SV state
SV state
SV state
SV state
![Page 11: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/11.jpg)
11
Coordinate Multiple Protocols for Different Access Patterns
A static data management scheme may not yield performance improvements on varied
data access patterns
![Page 12: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/12.jpg)
SV state
SV state
GR state
SV state
SV state
SV state
SV state
SV state
1. One-sided PUT/GET to SV home
2. Migrate Referencing Task to SV home 3. Directory-based Protocol
Composite Protocols
![Page 13: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/13.jpg)
0 0 0 0 ⋅⋅⋅ 0 〈 Shared Variable (SV)〉↔
n bits: one per nodedirty bit
0 1 0 0 ⋅⋅⋅ 0 SV is only in its allocated node
1 0 0 1 ⋅⋅⋅ 0 Only one node can have a dirty SV
0 0 1 1 ⋅⋅⋅ 1 Multiple nodes may have clean SVs
Directory Entries
![Page 14: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/14.jpg)
14
Node
SV state
SV state
![Page 15: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/15.jpg)
15
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
![Page 16: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/16.jpg)
16
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
![Page 17: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/17.jpg)
17
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
Home node for SV : the node where SV is allocatedRemote node for SV : a node whose memory does not store SV
![Page 18: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/18.jpg)
18
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
Example: Home node is j
Read/Write activity at node i
j
i
![Page 19: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/19.jpg)
19
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node i
Request
![Page 20: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/20.jpg)
20
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
![Page 21: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/21.jpg)
21
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
0 0 1 0 ⋅⋅⋅ 00ij
![Page 22: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/22.jpg)
22
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
Data Copy
0 0 1 0 ⋅⋅⋅ 00ij
![Page 23: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/23.jpg)
23
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
0 0 1 1 ⋅⋅⋅ 00ij
![Page 24: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/24.jpg)
24
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 1 0 1 0 ⋅⋅⋅ 00
ij
![Page 25: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/25.jpg)
25
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state.
Write back1 0 1 0 ⋅⋅⋅ 00
ij
![Page 26: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/26.jpg)
26
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 0 ⋅⋅⋅ 00
ij
SV Copy
![Page 27: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/27.jpg)
27
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 1 ⋅⋅⋅ 00
ij
![Page 28: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/28.jpg)
28
Performance Evaluation
![Page 29: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/29.jpg)
29
Communication Patterns
Test CompositeProtocols
Data Structures / Granularities
What do We Want in Benchmarks?
![Page 30: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/30.jpg)
30
Best Hand Coded Versions
Performance Comparison
• X10’s Shared Memory Protocol (X10-Mem)• Directory-based Protocol (GR-Mem)• Combination (X10-Mem/GR-Mem)
![Page 31: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/31.jpg)
31
Code- and Data-Layout Restructurings
Patterns of Shared Variable Accesses
A Read-mostly: Replicate node-local copies --- reduce remote access
B Write-mostly: Intact: localize write access to the site of allocation
C Aggregate Data: Refactor into individual objects for element-wise access --- reduce false sharing
D Write-Following-Read from each place: Collecting Sum Reducer – reduce frequent remote writes
E Write-Once: Replicate node-local copies --- reduce remote access
![Page 32: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/32.jpg)
32
Code Restructurings in Hand-coded Versions
Benchmarks Code RestructuringsA B C D E
FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔ ✔ ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG) ✔ ✔ ✔Delaunay Mesh Refinement (DMR) ✔ ✔ ✔
![Page 33: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/33.jpg)
33
• CentOS Linux 6.0• 1 Node = 2 HyperTransport connected CPUs• QuadCore Opteron Processors
Heldar(Opetron)No. nodes Cores per node Memory per node
16 8 8 GB
Platform
![Page 34: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/34.jpg)
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
![Page 35: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/35.jpg)
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
![Page 36: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/36.jpg)
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
![Page 37: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/37.jpg)
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
![Page 38: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/38.jpg)
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
![Page 39: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/39.jpg)
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem/X10-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
![Page 40: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/40.jpg)
…Node 1 Node N
Write-Once / Read-Mostly
Node 1Node 1Node 1 Node 1…
Replication
…Node 1 Node N
Result Object
…Node 1 Node N
Collecting Sum Reducer
GR state
GR state
GR state
GRstate
GR state
GR state
GRstate
GRstate
1. One-sided PUT/GET to GR home
2. Migrate Referencing Task to GR home
3. Directory-based Protocol
68
79
K-M
eans
128 Workers
Speedup
Benchmarks Code Restructurings
A B C D E
FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG)
✔
Delaunay Mesh Refinement (DMR)
✔ ✔ ✔
Applicable to (A)PGAS LanguagesChapel, Fortress
![Page 41: A Coherence Protocol for Optimizing Global Shared Data Accesses](https://reader035.vdocuments.us/reader035/viewer/2022062813/5681652c550346895dd7afc7/html5/thumbnails/41.jpg)
41
Questions?