parray: the array-based gpu programming … the array-based gpu programming technology yifeng chen...
TRANSCRIPT
![Page 1: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/1.jpg)
PARRAY: The Array-Based GPU Programming Technology
Yifeng Chen
School of EECSPeking University, China.
![Page 2: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/2.jpg)
Two Conflicting Approaches for Programmability in HPC
Top-down ApproachCore programming model is high-level (e.g. func parallel lang)Must rely on heavy heuristic runtime optimizationAdd low-level program constructs to improve low-level controlRisks:
Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.
Bottom-up Approach (PARRAY)Core programming model exposes the memory hierarchySame algorithm, Same performance, Same intellectual challenge, but Shorter codeRuntime optimization possible, but not part of the core model.
![Page 3: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/3.jpg)
Basic Notation
• Dimensions in a tree• A dimension may refer to another array type.
![Page 4: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/4.jpg)
Motivating Examples for PARRAY
![Page 5: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/5.jpg)
Thread Arrays
![Page 6: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/6.jpg)
#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{
#create P(p)#create H(host)#detour P(p) {
float* dev;INIT_GPU($tid$);#create D(dev)#insert DataTransfer(dev, G, host, H){}
}#destroy H(host)#destroy P(p)
}
pthread_create
sem_post
sem_wait
pthread_join
Generating CUDA+Pthread
![Page 7: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/7.jpg)
#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G
float* host;_pa_mpi* m;
#mainhosts{#create M(m)#create H(host)#detour M(m) {
float* dev;#create H_1(dev)#insert DataTransfer(dev, G, host, H){}
}#destroy H(host)#destroy M(m)
}
Generating MPI or IB/verbs
MPI_Scatter
![Page 8: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/8.jpg)
ALLTOALL
BCAST
Other Communication Patterns
![Page 9: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/9.jpg)
One-Line CUDA Code
![Page 10: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/10.jpg)
Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010Zero-copy for hmem
![Page 11: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/11.jpg)
![Page 12: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/12.jpg)
(Before Nov 2011)
![Page 13: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/13.jpg)
Direct Simulation of Turbulent Flows
ScaleUp to 14336 3D Single-Precision12 distributed arrays, each with 11 TB data (128TB total)Entire Tianhe-1A with 7168 nodes
Progress4096 3D completed8192 3D half-way and 14336 3D tested for performance.
Software TechnologiesPARRAY (ACM PPoPP’12) code only 300 lines.Programming-level resilience technology for stable computation Conclusion: GPU-accelerated large simulation on entire Tianhe-1A is feasible.
![Page 14: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/14.jpg)
![Page 15: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/15.jpg)
![Page 16: PARRAY: The Array-Based GPU Programming … The Array-Based GPU Programming Technology Yifeng Chen School of EECS Peking University, China](https://reader031.vdocuments.us/reader031/viewer/2022022517/5b09b6607f8b9abe5d8d16f6/html5/thumbnails/16.jpg)
DiscussionsCan other programming models benefit from PARRAY ideas?
MPI (more expressive datatype)OpenACC (optimization for coalescing accesses)PGAS (generating PGAS library calls)IB/verbs (directly generating Zero-Copy IB calls)
PARRAY helps, if you can write it down!Any index expressions using add/mul/mod/divIrregular structures must be encoded into arrays and then benefit from PARRAY. Generating Pthread + CUDA + MPI (future support of FPGA and MIC possible) + macrosMacros are compiled out: no performance loss.Typical training = 3 days, Friendly to engineers, geophysicists…