boyi: a systematic framework for automatically deciding ... · opencl feature1:swi kernel pipelined...
TRANSCRIPT
![Page 1: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/1.jpg)
Boyi: A Systematic Framework for Automatically Deciding the Right Execution
Model of OpenCL Applications on FPGAs
Jiantong Jiang (Northeastern University, China),Zeke Wang (Zhejiang University, China),Xue Liu (Northeastern University, China),
Juan Gómez-Luna (ETH Zürich, Switzerland),Nan Guan (Hong Kong Polytechnic University),Qingxu Deng (Northeastern University, China),
Wei Zhang (Hong Kong University of Science and Technology),Onur Mutlu (ETH Zürich , Switzerland)
![Page 2: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/2.jpg)
Outline
• Background and Motivations
• Our Solution
• Experiment
• Conclusion
![Page 3: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/3.jpg)
What is OpenCL?
• OpenCL stands for Open Computing Language.
• OpenCL has been developed for heterogeneous
computing environments with a host-accelerator
execution model.
ØThe CPU runs the control task.ØThe GPU/ FPGA runs the computing kernel.
![Page 4: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/4.jpg)
Global Memory Interconnect
Global Memory
...
...
CU-1Local Memory
Pipeline
...
Private Memory
...
CU-NLocal Memory
Pipeline
...
Private Memory
OpenCL on FPGA
External DDR
FPGAMemory blocks
• Software-centric à FPGA as a parallelarchitecture.
• Users can program with OpenCL.
DSP blocks
Memory blocks
Logic blocks
• Hardware-centric à fine-gainedparallelism
• Users need to program with HDL.
OpenCLSDK
![Page 5: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/5.jpg)
• How conventional OpenCL performs on an FPGA?
Core2Core1Core0
Conventional OpenCL: NDRange Kernel
Explicit multi-threaded à executing the same operation on multiple data concurrently
Load
Compute
store
Load
Compute
store
Load
Compute
store
Data 0 Data 1 Data 2
![Page 6: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/6.jpg)
• Conventional OpenCL cannot always represent
FPGA architecture in an efficient manner.
Issue of Conventional OpenCL
1.6x 12.1x 4.1x 3.0x 4.7x 11.3x 21.0x
![Page 7: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/7.jpg)
The optimal performance is enabled by two OpenCL features!
![Page 8: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/8.jpg)
• The SWI model executes the kernel in only one
CU that contains only one work-item.
Load
Compute
store
OpenCL Feature 1: SWI Kernel
Pipelined parallelismà No conflict among work items.
D0 D1 D2 D3 D7D6D5D4
8 Data Points
![Page 9: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/9.jpg)
OpenCL
channel
OpenCL Feature 2: OpenCL Channel
• OpenCL channel can be used to pass data
between two OpenCL kernels (typically SWI).
ØSynchronizing the kernelsØReducing the number of global memory accesses
Global memory
FPGA
Producer kernel
Consumer kernel
![Page 10: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/10.jpg)
• Two OpenCL features exponentially increase design space.• Enabled by two features, we have four OpenCL execution models:
• For each execution model, we have at least six optimizationmethods:
• For each optimization method, we have different pragmas.
Challenges
NDRange SWI NDRange+Channel SWI+Channel
SM MC PM UL SIMD CU
• The compilation time is extremely long!
![Page 11: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/11.jpg)
73.8
21.0 32.4 17.1
2.7
147.7
837.1
34.7
4.6
1.0 2.7 1.5
0.3
31.4
6.6
0.025 0.01
0.1
1
10
100
1000
RSCD TQH HSTO SC CEDD KM MM MS
Spee
dup
over
the
GPU
bas
elin
e
Most suitable execution model
Most unsuitable execution model
Effect of Four Execution Models
• Different execution models can significantly affect
the performance.
NDRange SWI NDRange+Channel SWI+Channel
Execution model should be decided first.
![Page 12: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/12.jpg)
Can we explicitly determine the most suitable
execution model (i.e., whether or not to use
two OpenCL features) to optimize OpenCL
programs on FPGAs?
We provide a systematic framework Boyi to
automatically determine the most suitable
execution model.
![Page 13: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/13.jpg)
Outline
• Background and Motivations
• Our Solution
Ø OpenCL Pattern RecognitionØ Execution Model Prediction
• Experiment
• Conclusion
![Page 14: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/14.jpg)
Architecture of Boyi• Boyi explicitly determines the most suitable execution
model to optimize OpenCL programs on FPGAs.• OpenCL Pattern Recognition• Execution Model Prediction
Clang
Frontend
LLVM
IR
Direct
Prediction
Potential
Evolution
Four Execution Models
NDR
SWI
NDR + C
SWI + C
SWI
OpenCL
Channel
OpenCL Pattern Recognition Execution Model Prediction
OpenCL Kernel
Source Code
Host C/C++
Source Code
Most Suitable
Execution ModelKKC Recognition
MPS Recognition
AO Recognition
AO: Atomic Operation
MPS: Multi-Pass Scheme
KKC: Kernel-to-Kernel Communication
![Page 15: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/15.jpg)
Outline
• Background and Motivations
• Our Solution
Ø OpenCL Pattern RecognitionØ Execution Model Prediction
• Experiment
• Conclusion
![Page 16: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/16.jpg)
5
• Issues on FPGAs:
OpenCL Pattern: Atomic OperationInput data
Hash function
Histogram
0
1
2
3
10
61
42
73
24
55
96
07
Hash index
12032110
8 work-itemsConflict
Noticeable resource overheadLong latency and low bandwidthLow frequency à AO is not a good fit on FPGAs.
1
6
4
7
2
9
0
![Page 17: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/17.jpg)
5
OpenCL Pattern: Atomic Operation
• Potential on FPGAs:Input data
Hash function
Histogram
0
1
2
3
10
61
42
73
24
55
96
07
Hash index
12032110
1
6
4
7
2
9
0
![Page 18: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/18.jpg)
OpenCL Pattern: Multi-Pass Scheme
in 6 0 2 31 4 3 52 8 9 01 6 4 7
0 6 6 80 1 5 80 2 10 190 1 7 11out
11131918local_sum
Step 1: 4 work-groups
5037180pre_sum
50 56 56 5837 38 42 4518 20 28 370 1 7 11out
Step 2: 1 work-group
in 6 0 2 31 4 3 52 8 9 01 6 4 7
0 6 6 80 1 5 80 2 10 190 1 7 11out
Step 3: 4 work-groups
11131918local_sum
![Page 19: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/19.jpg)
• Issues on FPGAs:
• Potential on FPGAs:
OpenCL Pattern: Multi-Pass Scheme
6 0 2 31 4 3 52 8 9 01 6 4 7
50 56 56 5837 38 42 4518 20 28 370 1 7 11out
in
More memory traffic
à MPS is not a good fit on FPGAs.
![Page 20: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/20.jpg)
OpenCL Pattern:
Kernel-to-Kernel Communication
• Issues on FPGAs:
The communication via global memory is expensive.
Global memory
FPGA
Producer kernel
Consumer kernel
![Page 21: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/21.jpg)
• Potential on FPGAsReducing the number of memory accessesInter-kernel parallelism (i.e., concurrent kernel execution)
OpenCL
channel
Global memory
FPGA
Producer kernel
Consumer kernel
OpenCL Pattern:
Kernel-to-Kernel Communication
![Page 22: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/22.jpg)
#Kernels >1?
#Kernels >1? #KKCTrue = 0
R1: NumKernels
R2: IsSameBuff
#R2Triplets > 0?
#Kernels
Y
N
#R2Triplets
#KKCTrue
Pass for host C/C++ analysis
Pass for OpenCL kernel analysis
Y
#MPSTrue = 0
#Kernels
Y
N
#R4Triplets
#MPSTrue = 0 N
Y
Buffs
#MPSTrue
R2BuffTriplets
#KKCTrue = 0 N
#R2Triplets
Y
R2BuffTriplets
#MPSTrue = 0 N
R4SeqTriplets
R1: NumOfKernels
R2: IsSameBuff
R2: IsRdWr
R2: IsRdWr
R3: IsSameMAP
R4: IsSequential
R5: VarBuffInHost
R5: BuffInKernel
#AOTrue
(c) MPS recognition
(b) KKC recognition
(a) AO recognition
#R2Triplets > 0?
#R4Triplets > 0?
HasAO
R5: VarInKernel
Vars, VarVals
Args, ArgVals
•We develop nine LLVM passes
to recognize three OpenCL
patterns.
• AO recognition• KKC recognition• MPS recognition
OpenCL Pattern Recognition
The implementation details can be found in our paper!
![Page 23: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/23.jpg)
Outline
• Background and Motivations
• Our Solution
Ø OpenCL Pattern RecognitionØ Execution Model Prediction
• Experiment
• Conclusion
![Page 24: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/24.jpg)
Execution Model Prediction
• Direct prediction
Kernels with AO and MPS benefit from the SWI kernel.Kernels with KKC benefit from the OpenCL channel.
AO MPS KKCN N NY N NN Y NY Y NN N YY N YN Y YY Y Y
Direct predictionNDR
SWI
SWI
SWI
NDR+C
SWI+C
SWI+C
SWI+C
![Page 25: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/25.jpg)
Execution Model Prediction
• Potential evolution of SWI
ØConditions:
AO MPS KKC Direct predictionN N N NDR
Y N N SWI
N Y N SWI
Y Y N SWI
N N Y NDR+C
Y N Y SWI+C
N Y Y SWI+C
Y Y Y SWI+C
Potential evolutionNDR
SWI+C
SWI+C
SWI+C
NDR+C
SWI+C
SWI+C
SWI+C
Sufficient FPGA resource.The SWI kernel is compute-bound.
![Page 26: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/26.jpg)
Outline
• Background and Motivations
• Our Solution
• Experiment
ØExperimental SetupØEffect of Execution ModelØPrediction of Execution Model
• Conclusion
![Page 27: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/27.jpg)
• Platform: Terasic DE5a-Net board: Altera Arria 10 GX FPGA and 8GB 2-bank DDR3, with Altera OpenCL SDK version 16.1.• Workloads:
Experimental Setup
Benchmark Source Description AO MPS KKC DatasetsBFS Chai Breadth-First Search Y N N NY, NE, UT
RSCD RANSAC Y N Y 2000 iterations
TQH Task Queue System Y N N Basket
HSTO Histogram Y N N 256bins
SC Stream Compaction Y N N 50%
PAD Padding Y N N 1000*999
CEDD Canny Edge Detection N N Y Peppa, Maradona, Paw
KM Rodinia K-Means N N N 25600 points, 8 features
MM Intel demo Matrix Multiplication N N N A: 2k*1k, B: 1k*1k
MS Mandelbrot Set N N N 640*800, 2000 iterations
PS CUDA demo Prefix Sum N Y N 262144 points
![Page 28: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/28.jpg)
Comparison Methodology
• Hypothesis 1: Different execution models lead to
significant performance differences.
ØQuantitative comparison among execution modelsØExploring optimization combinations
• Hypothesis 2: Boyi can predict the most suitable
execution model for each OpenCL application.
![Page 29: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/29.jpg)
App Number of combinations Maximum speedupNDR SWI NDR+C SWI+C NDR SWI NDR+C SWI+C
BFS 17 7 7 9 1.9 3.1 1.2 3.1RSCD 25 10 24 46 15.8 4.6 73.8 39.7
TQH 9 15 23 1.1 1.3 21.0HSTO 13 37 11 29 2.7 5.1 16.9 32.4
SC 15 34 10 1.5 4.5 17.1PAD 10 10 14 1.2 1.6 4.8
CEDD 57 15 22 7 2.7 0.3 2.7 0.4
KM 33 11 10 18 147.7 32.8 136.4 31.4
MM 25 9 6 837.1 13.3 6.6
MS 7 6 7 34.7 0.02 3.2
PS 26 20 12 15.8 44.4 46.2
Hypothesis 1: Different executionmodel --> different performance
• Quantitative comparison among execution models
Different execution models result in significant performance differences.
Different applications require different execution models to achieve the best performance.
It is critical to decide the most suitable execution model when optimizing OpenCL applications on FPGAs.
![Page 30: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/30.jpg)
2.3 9.0
2.0 3.9 5.1
51.2
73.5
7.7
28.8
51.3
114.9
89.3
147.7
122.0
2.3 2.5 2.7 0.1 0.1 4.9 8.8 16.6
32.8
1.2 14.7
124.9 136.4
2.2 14.4
90.3
119.0 112.6
31.4
1.9 9.9
17.5 29.0
0
20
40
60
80
100
120
140
160
Spee
dup
over
the
GPU
bas
elin
e NDR SWI+CNDR+CSWI
•We manually implement sufficient number of
optimization combinations (subset) for KM, such
that we reach the near-to-optimal optimization
combination for each execution model.
Most suitable execution model Most unsuitable execution model
Hypothesis 1: Exploring optimization combinations for Each Execution Model
![Page 31: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/31.jpg)
Application AO MPS KKC Actual PredictedBFS Y N N SWI SWI
RSCD Y N Y NDR+C SWI+C
TQH Y N N SWI+C SWI+C ※
HSTI Y N N SWI+C SWI+C ※
SC Y N N SWI+C SWI+C ※
PAD Y N N SWI+C SWI+C ※
CEDD N N Y NDR+C NDR+C
KM N N N NDR NDR
MM N N N NDR NDR
MS N N N NDR NDR
PS N Y N SWI SWI
SWI+C ※ indicates the potential evolution of SWI
N NDR+C
The actual and predicted execution models roughly match.
Hypothesis 2: Boyi Predicts the Right Execution Model
![Page 32: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/32.jpg)
End-to-end Performance Comparison
• Performance comparison to existing works
Application Ours (ms) Existing work (ms) Our/Existing SpeedupRSCD [1] 0.8 28.9 38.3TQH [1] 66.9 150.6 2.3HSTO [1] 38.8 487.9 12.6CEDD [1] 161.9 237.8 1.5MM [2] 9.1 34.3 3.8MS [2] 27.2 944.1 34.7
[1] S. Huang et al., “Analysis and modeling of collaborative execution strategies for heterogeneouscpu-fpga architectures”, ICPE, 2019.[2] Intel. Intel SDK for OpenCL Design Examples. 2018
![Page 33: Boyi: A Systematic Framework for Automatically Deciding ... · OpenCL Feature1:SWI Kernel Pipelined parallelismàNo conflict among work items. D0 D1 D2 D3 D4 D5 D6 D7 8 Data Points](https://reader034.vdocuments.us/reader034/viewer/2022042709/5f5426c5ba69414a6514c7b5/html5/thumbnails/33.jpg)
Outline
• Background and Motivations
• Our Solution
• Experiment
• Conclusion