Download - Motivation
Automating and Optimizing Data Transfers for Many-core Coprocessors
Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang Other Collaborators: Min Feng, Srimat Chakradhar
CSE Poster Event 2014
Motivation
The Goal of This Work
Many-core coprocessors commonly have their own memory hierarchy
– Intel Xeon Phi– NVIDIA GPUs
Programming Challenges
Experimental Results CPU: Intel Xeon E5-2609 (8-
Core) Coprocessor: Intel Xeon Phi
(61-Core) -- MIC Compiler: ICC
Contributions
Static Mechanism and Runtime Mechanism
Programming with LEO/OpenAcc
Design dynamic (runtime library) or static (code transformation) methods to manage and optimize data communication between CPU and many-core coprocessors automatically for multi-dimensional arrays and multi-level pointers
– Minimize redundant data transfers – Utilize Direct Memory Accesses (DMA) – Reduce memory allocation on coprocessor – Preserve compiler optimization on coprocessor
State of the Art
Comparison of best CPU+MIC and CPU
Speedup of best CPU-MIC over 8-Core CPU
Study the performance bottleneck of the state-of-the-art dynamic and static methods
Design two novel heap Linearization algorithms and optimized MYO method to improve the communication performance
Implement a static source-to-source code transformer with the Partial Linearization with Pointer Reset design
Evaluate and analyze both dynamic and static approaches on multiple benchmarks to show the efficacy of our Partial Linearization with Pointer-Reset method
Data Transfer
CPUHost
PCIe
Many Core Coprocessor
8-core 60+ cores
Intel MIC NVIDIA GPU
…//Change Malloc-Site to split pointers and real data#pragma offload target(mic) in(A_data, B_data, C_data: length(m*n) REUSE){}#pragma offload target(mic) nocopy(A, B, C:length(n) ALLOC){ //Connect A, B, C with A_data, B_data, C_data} #pragma offload target(mic) nocopy(A, B, C: length(n)){ #pragma omp parallel for private(i) for (i = 0; i < n; i++) for (j = 0; j < m; j++) A[i][j] = B[i][j] * C[i][j];}#pragma offload target(mic) out(A_data: length(m*n) FREE)…
Productivity Performance
Current Approaches to Managing the data transfer between CPU and Coprocessor
Pros: Easy programming, Complex structuresCons: Slow (unnecessary synchronization)
Pros: Fast Cons: Users manageable data offloadOnly bit-wise copyable data
M Y
O
int * aint ** b
Our Static Mechanism
Our Combined Mechanism
Summary of Benchmarks
Comparison of Static Methods (Linearization) and OPT-Runtime (MYO)
Speedup of Static over OPT-Runtime Data Trans Size of Static over OPT-Runtime
Comparison of OPT-Runtime and Runtime (MYO)
Speedup of OPT-Runtime over Runtime Data Trans Size of OPT-Runtime over Runtime
Comparison of OPT-Complete Linearization and Complete Linearization
Speedup of OPT-CL over CL for MG Data Trans Size of OPT-CL over CL for MG
Partial Linearization with PR
High Dim Array Addition
Struct and Non-unit Stride Access
No modification to the access-site Preserve potential compiler optimizations Reduce possibility of introducing bugsReduce communication overhead Only transfer linearized data Minimize offloading numberDMA utilization Linearized data is in a dense memory buffer
– Explicit Message Passing
– Virtual Shared Memory (MYO)