accelerating forward modeling by cpu,gpu and mic · accelerating forward modeling by cpu,gpu and...
TRANSCRIPT
![Page 1: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/1.jpg)
Accelerating Forward Modeling
by CPU,GPU and MIC
You, Yang and Fu, Haohuan
Department of Computer Science & Technology
Tsinghua University, Beijing, China
![Page 2: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/2.jpg)
Outline
Stencil
Architectures
Optimization Techniques
Experimental results
![Page 3: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/3.jpg)
Introduction
• Forward Modeling
– Oil and Gas Exploration
• Iterative Stencil Loops
– 99% of the time
• Lax-Wendroff Correction (LWC)
– Dablain, 1986
![Page 4: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/4.jpg)
LWC stencil
• Memory Access
– (36+1+1)*3=114
• Operations
– 228
• Flop to Byte Ratio
– 0.25 ~ 9.5 (double)
– 0.5 ~ 19 (single)
![Page 5: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/5.jpg)
Outline
Stencil
Architectures
Optimization Techniques
Experimental results
![Page 6: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/6.jpg)
Architectures
Architectures Peak Performance
(Double Precision)
Flop to Byte Ratio
(Double Precision)
Intel Sandy Bridge
CPU
256 Gflops
3.76
Intel Knights Corner
MIC
1010 Gflops 6.35
Nvidia Fermi C2070
GPU
515 Gflops 5.31
Nvidia Kepler K20x
GPU
1320 Gflops 7.02
![Page 7: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/7.jpg)
Outline
Stencil
Architectures
Optimization Techniques
Experimental results
![Page 8: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/8.jpg)
General techniques
• On-chip data reuse
– algorithmic FBR < architectural FBR
– 0.50 (stencil) < 21 (Kepler)
• Remove high-latency operations
– e.g. divisions
• Get rid of branches
– avoid discontinuities in SIMD/SIMT
• Continuous memory access
– AOS to SOA
![Page 9: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/9.jpg)
GPU: General Strategies
• One thread One point – High parallelism
• One thread One line – on-chip data reuse
• Common tricks – best L1/smem configuration
– minimize PCIe communication
– right block size • warp size
• shared memory bank size
![Page 10: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/10.jpg)
CPU/MIC: General Strategies
• Task parallelism
– Inherent parallelism
• Data parallelism
– SIMD
• On-chip data reuse
– Cache blocking
• Offload and Native
![Page 11: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/11.jpg)
CPU/MIC: Additional techniques
• Prefetch
– memory to cache
• Schedule
– thread and iteration
• Cilk array notation
– replace compiler hints
• Cilk Plus
– replace OpenMP
![Page 12: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/12.jpg)
Outline
Stencil
Architectures
Optimization Techniques
Experimental results
![Page 13: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/13.jpg)
Experimental results
![Page 14: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/14.jpg)
Performance Efficiency
![Page 15: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/15.jpg)
Knights Corner vs. Kepler
• User-controlled SM > Compiler-controlled L1
• Programmabilty – CUDA vs. Knights Corner Instructions
– OpenACC vs. OpenMP/Cilk
• Performance/Effort (double precision) – Hard to evaluate
• Power Efficiency (whole workstation) – MIC: 0.2770 Gflops/watt (single precision)
– GPU: 0.4133 Gflops/watt (single precision)
![Page 16: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/16.jpg)
Future work
• Support Vector Machine
– data mining
• Many-core and Multi-core architectures
• Welcome for discussion
![Page 17: Accelerating Forward Modeling by CPU,GPU and MIC · Accelerating Forward Modeling by CPU,GPU and MIC You, Yang and Fu, Haohuan you-y12@mails.tsinghua.edu.cn Department of Computer](https://reader030.vdocuments.us/reader030/viewer/2022041302/5e134117398eba079648c642/html5/thumbnails/17.jpg)
Thank You
You, Yang and Fu, Haohuan
Department of Computer Science& Technology
Tsinghua University, Beijing, China