1 a study of different instantiations of the openmp memory model and their software cache...
Post on 21-Dec-2015
215 views
TRANSCRIPT
![Page 1: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/1.jpg)
1
A Study of Different Instantiations of the OpenMP Memory
Model and Their Software Cache Implementations
Chen ChenJoseph B Manzano
Ge GanGuang R. GaoVivek Sarkar
April 21st, 2009
![Page 2: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/2.jpg)
2
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
![Page 3: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/3.jpg)
3
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
![Page 4: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/4.jpg)
4
Situation of Shared Memory Parallel Programming —
3-Tier Hierarchy of Programming Models
Joe Parallel Programmers
Parallel Programming
Specialists
Computer System Specialists
Have some basic knowledge of parallel programming languages
Have some basic knowledge of memory consistency and
cache organizations
Expert on parallel system
architecture
![Page 5: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/5.jpg)
5
The OpenMP Memory Model• Temporary view
– Register, cache, local storage
– Cache variables– Not required
• Flush operation– Enforce consistency– Flush-set– Reordering restriction– Serialized requirement
• Data-race program– Unspecified behavior
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Is the OpenMP Memory Model well-defined?
![Page 6: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/6.jpg)
66
The OpenMP Memory Model is not Well-defined
• Complex semantic of the temporary view– Some threads own temporary views, others not– Why access memory if the temporary view has a copy?
• Unspecified semantics of data-race programs– Why reordering (between flush and memory accesses)
is still restricted?– The applications are limited to be data-race-free.
• Unclear definition of the flush operations– Variables may “escape” temporary view before flush– Serialized requirement is unnecessary
![Page 7: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/7.jpg)
7
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
![Page 8: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/8.jpg)
88
Our solution
• Simple semantics for the temporary view.– We defined ModelIDEAL with very simple semantics.
• Specified behaviors of all programs.– We defined four instantiations of the OpenMP memory
model. Each one has specified semantics for any program. (They are equivalent for DRF programs.)
• Clear definition of the flush operation.– We defined simple semantics of the flush operation.– We introduced the non-deterministic flush operation to
solve the space limitation problem.
![Page 9: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/9.jpg)
99
ModelIDEAL
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Each thread owns a temporary view.
Infinitely big space.
Write: Access of the temporary view.
Read: 1) Access of the temporary view (Hit); or 2) Access of the main memory (Miss).
Flush: Writing back “dirty values” and discarding all the values. (one thread)
![Page 10: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/10.jpg)
1010
ModelGF
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Limited space
Non-deterministic flush: A flush operation can be performed at any time. (To solve the limited space problem)
Global flush: Flush operation on all of the threads.
![Page 11: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/11.jpg)
1111
ModelLF
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Limited space
Non-deterministic flush.
Local Flush: Flush operation on one thread.
![Page 12: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/12.jpg)
1212
ModelRLF
Thread
Temporary View of
the Memory
Thread
Temporary View of
the Memory
Interconnection Network
Main Memory
Thread
Temporary View of
the Memory
Limited space
Non-deterministic flush.
Acquire: Discarding the “clean value”.
Release: Writing back “dirty value”.
Barrier: Acquire + Release.
![Page 13: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/13.jpg)
13
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
![Page 14: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/14.jpg)
1414
Implementations – Cache Protocols
• Each thread contains a cache which corresponds to its temporary view.
• Each operation is performed on one cache line.
• Per-location dirty bits in each cache line.
![Page 15: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/15.jpg)
1515
Centralized Directory for ModelGF
• Directory in the main memory
• Information of all the caches
• A flush will look up the directory and inform the involved threads
![Page 16: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/16.jpg)
16
Outline
• The OpenMP memory model is not well-defined
• Our solution: Four well-defined instantiations of the OpenMP memory model
• Implementations – Cache Protocols
• Experimental Results
![Page 17: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/17.jpg)
1717
Cell Architecture
• Very small local storage per SPE
• Local storage stores both data and instructions
• SPE accesses global memory by DMA transfers
Element Interconnect Bus (EIB)Power
processingelement(PPE)
Globalmemory
Synergistic processing elements
(SPE)
Local storage256K
Local storage256K
Local storage256K
Local storage256K
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Synergistic processing elements
(SPE)
Local storage256K
Local storage256K
Local storage256K
Local storage256K
![Page 18: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/18.jpg)
1818
OPELL Framework
Single Source Compiler
Generating sequential codes for PPE
Generating parallelcodes for SPEs
RuntimeSystem (PPE)
PartitionManager
SoftwareCache
Managing softwarecaches on SPEs
Loading/Unloading SPEs’ codes
Executing the sequential codes
Triggering PM to execute tasks
Task assignments
Remote function calls
RuntimeSystem (SPE)
• An open source toolchain / runtime effort to implement OpenMP for the CBE
SWC is in the local storage
![Page 19: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/19.jpg)
1919
Experimental Testbeds
• Hardware (PlayStation 3tm)– 3.2 GHz Cell Broadband Engine CPU (with 6 accessible
SPEs)– 256MB global shared memory.
• Software Framework (OPELL)– An open source toolchain / runtime effort to implement
OpenMP for the CBE.• Benchmarks
– RandomAccess and Stream from the HPC Challenge benchmark suite.
– Integer Sort (IS), Embarrassingly Parallel (EP) and Multigrid (MG) from the NAS Parallel Benchmarks.
![Page 20: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/20.jpg)
2020
Summary of Main Results
• Performance and Scalability– ModelLF consistently outperforms ModelGF
– Good scalability of ModelLF
• Impact of Cache Line Eviction – The cache line eviction has a significant impact on the
performance difference between ModelLF outperforms ModelGF
– Such difference become larger as the cache size (per core/thread) is becoming smaller
• Programmability – In our experiments, little changes are needed on the OpenMP code– In other words, the performance advantage is achieved without
compromising the programmability
![Page 21: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/21.jpg)
2121
ModelGF vs. ModelLF on Execution Time (Cache size = 32K)
Performance improvement:
EP-A: 1.53×IS-W: 1.32×MG-W: 1.19×RandomAccess: 1.05×Stream: 1.36×
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
EP- A I S- W MG- W RandomAccess St ream
Norm
aliz
ed E
xecu
tion
Tim
e
Model - LFModel -GF
ModelLF consistently outperforms ModelGF
![Page 22: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/22.jpg)
2222
Speedup as a Function of the Number of SPEs under ModelLF
IS-W and EP-Wachieve almostlinear speedup.
MG-W performsworse becauseof unbalancedworkloads. (3, 5 or 6 SPEs)
0
1
2
3
4
5
6
1 2 3 4 5 6
Number of Threads (SPEs)
Spee
dup
MG-WI S-WEP-W
![Page 23: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/23.jpg)
2323
ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for IS-W
The difference of normalized exec-ution time increa-sed from 0.15 to 0.25 as the cache size per SPE was decreased from 64KB to 4KB.
The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings.
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
4k 8k 16k 32k 64kCache Si zes
Norm
aliz
ed E
xecu
tion
Tim
e
Model - GFModel - LF
0. 00%
1. 00%
2. 00%
3. 00%
4. 00%
5. 00%
6. 00%
7. 00%
8. 00%
9. 00%
10. 00%
4k 8k 16k 32k 64kCache Si zes
Cach
e Ev
icti
on R
atio
Model - GFModel - LF
![Page 24: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/24.jpg)
2424
ModelGF vs. ModelLF on Execution Time and Cache Eviction Ratio for MG-W
The difference of normalized exec-ution time increa-sed from 0.04 to 0.16 as the cache size per SPE was decreased from 32KB to 4KB.
The two curves of cache eviction rat-io are overlapped because of comp-letely identical cache settings.
GF
LF
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
4k 8k 16k 32kCache Si zes
Norm
aliz
ed E
xecu
tion
Tim
e
Model - GFModel - LF
0. 00%
0. 10%
0. 20%
0. 30%
0. 40%
0. 50%
0. 60%
0. 70%
0. 80%
0. 90%
1. 00%
4k 8k 16k 32kCache Si zes
Cach
e Ev
icti
on R
atio
Model - GFModel - LF
![Page 25: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/25.jpg)
2525
Conclusion and Future Work
• Our contributions– Formalization of the OpenMP memory model
– Performance studies of ModelGL and ModelLF
• Future work– Studies of ModelRLF
– Tests on more benchmarks– Evaluations on more many-core architectures
![Page 26: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/26.jpg)
2626
Acknowledgement
• This work was supported by NSF (CNS-0509332, CSR-0720531, CCF-0833166, CCF-0702244), and other government sponsors.
• Joseph B Manzano, Ge Gan, Guang R. Gao and Vivek Sarkar are co-authors of the paper
• Joseph B Manzano and Guang R. Gao gave a lot of useful comments on the slides
![Page 27: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/27.jpg)
27
BACKUP
![Page 28: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/28.jpg)
2828
Example (1) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1.1: *p = 1;2: #pragma omp flush (X) } #pragma omp section { // Section 2, assume it is running on thread T2.3: *q = 2;4: #pragma omp flush (X) } #pragma omp section { // Section 3, assume it is running on thread T3.5: #pragma omp flush (X) // Assume that compiler cannot establish that p == q6: v1 = *p;7: v2 = *q;8: v3 = *p; } }
ModelIDEAL: v1, v2 and v3 always read the same value. (E.g. {v1==v2==v3==0(or 1, 2)})
ModelGF: v1, v2 and v3 may read different values. (E.g. {v1==1, v2==v3==2} if there is a non-deterministic flush between statements 6 and 7.)
ModelLF: v1, v2 and v3 may read different values.
The order 1-3-2-4-5-6-7-8 results in {v1==v2==v3==2} under ModelIDEAL; {v1==v2==v3==1} under ModelGF; and {v1==v2==v3==2} under ModelLF.
![Page 29: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/29.jpg)
2929
Example (2) X = 0; p = &X; q = &X; #pragma omp parallel sections { #pragma omp section { // Section 1, assume it is running on thread T1.1: *p = 1;2: #pragma omp critical // A flush (acquire) here.3: v1 = *p;4: // A flush (release) here. } #pragma omp section { // Section 2, assume it is running on thread T2.5: *q = 2;6: #pragma omp critical // A flush (acquire) here.7: v1 = *q;8: // A flush (release) here. } }
ModelIDEAL, ModelGF, and ModelGF: Statements 2 and 6 will remove the values in the temporary views – Statements 3 and 7 have to access the main memory.
ModelRLF: Statements 2 and 6 are acquire operations. The values are preserved in the temporary views – Statements 3 and 7 can access the temporary views to get the values.
![Page 30: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/30.jpg)
3030
States of Cache Lines
• Invalid: All the words of the cache line are invalid.
• Clean: All the words of the cache line contain “clean values”.
• Dirty: All the words of the cache line contain “dirty values”.
• Clean-Dirty: Clean + Dirty
• Invalid-Dirty: Invalid + Dirty
![Page 31: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/31.jpg)
31
State transition diagram for the cache protocol of ModelGF and ModelLF
Invalid Clean
Dirty
read
Clean-DirtyInvalid-Dirty read
read
read/write
read/write
write write
write
write write
write write
flush
flush
flushflush
flush
![Page 32: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/32.jpg)
32
State transition diagram for the cache protocol of ModelRLF
Invalid Clean
Dirty
read
Clean-DirtyInvalid-Dirty
read
read/release
read/write
read/write/acquire
write write
write/acquire
write write
writewrite
acquire/barrier/flush
acquire
acquire/release/barrier/flush
release/barrier
release
release barrier/flushbarrier/
flush
flush
![Page 33: 1 A Study of Different Instantiations of the OpenMP Memory Model and Their Software Cache Implementations Chen Joseph B Manzano Ge Gan Guang R. Gao Vivek](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649d565503460f94a33805/html5/thumbnails/33.jpg)
3333
Overall Experimental Results: ModelGF vs. ModelLF
benchmarks Performance (execution time) improvement
4K Cache 8K 16K 32K 64K
EP-A 1.53×
IS-W 1.33× 1.32× 1.32× 1.32× 1.32×
MG-W 1.20× 1.23× 1.21× 1.19×
RandomAccess 1.05×
Stream 1.36×
ModelLF consistently outperforms ModelGF