1 qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping chi-keung...
TRANSCRIPT
1
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with
Adaptive Mapping
Chi-Keung (CK) LukChi-Keung (CK) LukTechnology Pathfinding and InnovationTechnology Pathfinding and InnovationSoftware Solutions and Services GroupSoftware Solutions and Services Group
IntelIntel
Sunpyo HongSunpyo HongElectrical and Computer Electrical and Computer
EngineeringEngineeringGeorgia Institute of TechnologyGeorgia Institute of Technology
Hyesoon KimHyesoon KimCollege of ComputingCollege of Computing
School of Computer ScienceSchool of Computer ScienceGeorgia Institute of TechnologyGeorgia Institute of Technology
MICRO’09
2Heterogeneous Architectures
Heterogeneous architectures are increasingly popular:
Intel Core2 + Nvidia’s GPU
IBM’s Cell processor
Platform used
NHM + Larrabee
MICRO’09
3Software Challenge
GPU
Core-0 Core-1
Core-2 Core-3
CPU
SIMD
A CPU + GPU system:
The Mapping Problem:
Map computations to PEs to optimize an objective function, which could be:
• Performance
• Energy
• Performance / Energy
MICRO’09
4Existing Solutions to the Mapping
Problem
Programmer performs the mapping manually and statically
Examples: IBM XL compiler extension that supports
OpenMP on the Cell Intel CTG’s ExoCHI/Merge framework for
programming the CPU and GPU Disadvantages:
Labor intensive Not adaptable to changes in runtime
environments
MICRO’09
5Outline
IntroductionCase StudyAdaptive MappingExperimental EvaluationConclusions
MICRO’09
6
Case Study: Matrix Multiplication
Heterogeneous machine used: CPU: dual-socket QuadCore (max = 8 cores) GPU: Nvidia GTX-8800 GPU
Three configurations tested:1. Small problem size, max CPU cores used2. Big problem size, max CPU cores used3. Big problem size, fewer CPU cores used
In each configuration:– Perform cooperative matrix multiplication
(varying the distribution of works over the CPU and GPU)
MICRO’09
7Cooperative Matrix Multiplication
C = A Bx
C1
= BxA1
C2 A2
CPU
GPU
MICRO’09
8Cooperative Matrix Multiplication Results
5.1 5.2 5.7 6.17
5.74.7 4.1 3.6 3.2
7.7
0x
2x
4x
6x
8x
10x
GPU-o
nly
90/10
80/20
70/30
60/40
50/50
40/60
30/70
20/80
10/90
CPU-o
nly
Sp
ee
du
p o
ve
r S
eri
al
Configuration 1:
Matrix dimension size = 1000
#CPU cores = 8
Configuration 3:
Matrix dimension size = 6000
#CPU cores = 2
7.68.4
6.5
54
3.3 2.8 2.5 2.2 2
9.3
0x
2x
4x
6x
8x
10x
GPU-onl
y90
/1080
/2070
/3060
/4050
/5040
/6030
/7020
/8010
/90
CPU-onl
y
Sp
ee
du
p o
ve
r S
eri
al
7.6 8 8.79.7
97.7
6.75.9 5.3
7.4
10.3
0x
2x
4x
6x
8x
10x
12x
Sp
eed
up
ove
r S
eria
l
Configuration 2:
Matrix dimension size = 6000
#CPU cores = 8
Lessons Learned:
The optimal PE mapping depends on the application, the input size, and hardware resources available
Need an automatic and dynamic technique that takes all these factors into account
Our contribution: ADAPTIVE MAPPING
MICRO’09
9Adaptive Mapping
A technique to automatically find the near-optimal mapping for the given program, problem size and hardware
Each <program, hardware> configuration involves one training run and many reference runs: Training run:
• Find the execution-time projections of the CPU and the GPU for the given configuration
Reference run:• Compute the near-optimal distribution of work
for the current problem size
MICRO’09
10
Training Run
Kernel K
NtN1,1 N1,m
K K K K
N2,1 N2,m
KKKK
time taken: Tc (N1,1)
curve fitting
Input sizeRu
nti
me
curve fitting
T’C(N) T’G(N)
Tc (N1,m) TG(N2,1) TG(N2,m)
Database
T’C(N) = The projected time to executethe kernel of problem size N on the CPU
= ac + bc * N
T’G(N) = The projected time to executethe kernel of problem size N on the GPU
= ag + bg * N
MICRO’09
11
Reference Run
Database
T’(Nr) = Max( p/(p-1)T’C(βNr), T’G((1-β)Nr) )
Find β to minimize T’(Nr)
K
CPU
K
GPU
β = Fraction of work mapped to CPU p = Number of CPU cores N = Problem size
T’β(N) = The projected time to execute β N work on the CPU and (1- β)N work on the GPU
= Max( p/(p-1)T’C(βN), T’G((1-β)N) )
Once N is fixed to the actual problem size Nr, we find the β that minimizes T’β(Nr).
We consider where the two curves
p/(p-1)T’C(βNr) and T’G((1-β)Nr) intersect.
There are 3 possible cases (see next slide)
K
Nr
MICRO’09
12
Three Possible Cases of β
Minimized when mapping all work to the GPU
Time
0 1
CPU: (p/p-1)T’c(Nr)
GPU: T’G((1-Nr)
Case i: CPU and GPU curves intersect at β <= 0 Case ii: The two curves intersect at β >= 1Time
0 1
CPU: (p/p-1)T’c(Nr)
GPU: T’G((1-Nr)
Minimized when mapping all work to the CPU
Case iii: The two curves intersect at 0<β<1
0 1
CPU: (p/p-1)T’c(Nr)
GPU: T’G((1-Nr)Minimized when mapping min of work to the CPU
min
MICRO’09
13Outline
IntroductionCase StudyAdaptive MappingExperimental EvaluationConclusions
MICRO’09
14Prototype Implementation
Adaptive mapping could be implemented as: Off-line optimization for static compilation On-line optimization for dynamic compilation
Our prototype: A dynamic compilation system called Qilin Qilin API:
• Both stream-based and thread-based Dynamic code generation:
• Generate TBB source code for the CPU• Generate CUDA source code for the GPU• Generate glue code to:
• Copy data back and forth between CPU and GPU• Stage computations onto GPU to satisfy GPU memory limitation• Division of work according to Adaptive Mapping
C++ App
Qilin APIQilin System
CPU GPU
MICRO’09
15Heterogeneous PC used
CPU GPU
Architecture Intel Core2 Quad Nvidia 8800 GTX
Core Clock 2.4 GHz 1.35GHz
Number of Cores 8 cores (on 2 sockets)
128 stream processors
Memory Size 4 GB 768 MB
Memory Bandwidth 8 GB/s 86.4 GB/s
Threading API Intel TBB Nvidia CUDA
Compiler ICC 10.1 NVCC 1.1
OS 32-bit Linux Fedora Core 6
MICRO’09
16
Benchmarks
Name Description Source
Binomial American option pricing CUDA SDK
BlackScholes European option pricing CUDA SDK
Convolve 2D separable image convolution CUDA SDK
MatrixMultiply Dense matrix multiplication CUDA SDK
Linear Linear image filter---compute output pixel as average of a 9-pixel square
Intel’s Merge
Sepia Modify RGB value to artificially age images Merge
Smithwat Compute scoring matrix for a pair of DNA sequences
Merge
Svm Kernel from a SVM-based face classifier Merge
(Financial, image processing, scientific)
MICRO’09
17
Adaptive mapping achieves 94% of the speedup of manual mapping
5.5 7
9.9
9.3
1x
10x
100x
Binom
ial
Black
Scholes
Convo
lve
Mat
rixM
ultiply
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al
CPU-always GPU-always
Manual mapping Adaptive mapping
Performance of Adaptive Mapping
(Note: The y-axis is in logarithmic scale)
MICRO’09
18Energy Consumption
100.
063
.349
.251
.0
0.0
50.0
100.0
150.0
200.0
250.0
300.0
Binom
ial
BlackS
chole
s
Convo
lve
Mat
rixM
ultipl
y
Linea
r
Sepia
Smith
watSvm
Geo-M
ean
No
rma
lize
d E
ne
rgy
Co
ns
um
pti
on
(%
)
CPU-always GPU-always Manual mapping Adaptive mapping
Adaptive mapping is nearly as good as manual mapping in energy consumption
(Total system power measured by Extech 38080 Power Analyser)
MICRO’09
19Distribution of Computations
Manual mapping Adaptive mapping
CPU GPU CPU GPU
Binomial 10% 90% 10.5% 89.5%
BlackScholes 40% 60% 46.5% 53.5%
Convolve 40% 60% 36.3% 63.7%
MatrixMultiply 40% 60% 45.5% 54.5%
Linear 60% 40% 50.8% 49.2%
Sepia 80% 20% 76.2% 23.8%
Smithwat 60% 40% 59.3% 40.7%
Svm 10% 90% 14.3% 85.7%
Adaptive mapping and manual mapping have similar distributions
MICRO’09
20Related Work
Hardware Kumar et al. demonstrate advantages of heterogeneous over
homogeneous CMPs in terms of power and throughput Similar observations from Hill and Mart
=> Both study point out the importance of the mapping problemSoftware
GPGPU: • Brook, Accelerator, Peakstream, Rapidmind, Brook+, Cuda
(they are all GPU only) Intel’s TBB and Ct (currently CPU only) IBM’s OpenMP extension for Cell and Intel’s ExoCHI/Merge
• Use both CPU and GPU, but based on static manual mapping OpenCL:
• Doesn’t seem to have any automatic mapping technique based on the initial specification
Autotuning• Generating many variants of a computation kernel and
benchmarking each variant on the target platform• Adaptive mapping can be regarded as an autotuning technique
that tunes for the distribution of works on heterogeneous platforms
MICRO’09
21
Conclusions Automates the mapping from computations
to heterogeneous multicores Encouraging results:
Performance and energy consumption close to manual mapping
Adapt to changes in input size, hardware & software configurations (see our paper)
Applicable to other heterogeneous systems OpenCL or Ct on NHM + Larrabee
Future work: Extend it to handle irregular computations
Adaptive mapping could be an important technique in the multicore software stack
MICRO’09
22Acknowledgments
Michael Linderman, Jamison Collins, Hong Wang Sharing their Merge benchmarks
Geoff Lowney and Mark Abel Support of this work
Geoff Lowney and Robert Cohn Suggestions and feedbacks
MICRO’09
23
MICRO’09
24Impact of Training Input Size
9.3
9.3
9.2
98.
27.
5
1x
10x
100x
Binom
ial
BlackS
choles
Convolve
Mat
rixM
ultipl
y
Linear
Sepia
Smith
watSvm
Geo-M
ean
Sp
ee
du
p o
ve
r S
eri
al 100% 80% 50% 30% 20% 10%
(Note: The y-axis is in logarithmic scale)Training input size as percentage of the reference input size
Most of the performance benefit of Adaptive Mapping preserved when the training input size is at
least 30% of the reference input size
MICRO’09
25Adapting to Hardware Changes (1)
5.5 5.7 8.2
1x
10x
100x
Binom
ial
Black
Schole
s
Convo
lve
Mat
rixM
ultip
ly
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al CPU-always
GPU-alwaysAdaptive mapping
Using a less powerful GPU (GTX8800 with 128 cores => GTS8800 with 96 cores)
Adaptive mapping automatically recovers part of
the performance loss in the GPU from the CPU 5.
5 79.
99.
3
1x
10x
100x
Binom
ial
Black
Scholes
Convo
lve
Mat
rixM
ultiply
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al
CPU-always GPU-always
Manual mapping Adaptive mapping
Original result
MICRO’09
26Adapting to Hardware Changes (2)
1.5
7 7.2
0x
1x
10x
100x
Binom
ial
Black
Schole
s
Convo
lve
Mat
rixM
ultip
ly
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al
CPU-always
GPU-always
Adaptive mapping
5.5 7
9.9
9.3
1x
10x
100x
Binom
ial
Black
Scholes
Convo
lve
Mat
rixM
ultiply
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al
CPU-always GPU-always
Manual mapping Adaptive mapping
Original result
Using a less powerful CPU (CPU with 8 cores => CPU with 2 cores)
Adaptive mapping shifts most work to the GPU
MICRO’09
27Adapting to Software Changes
7.1
13.6 16
.1
1x
10x
100x
Binom
ial
Black
Schole
s
Convo
lve
Mat
rixM
ultip
ly
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al CPU-always GPU-always Adaptive mapping
5.5 7
9.9
9.3
1x
10x
100x
Binom
ial
Black
Scholes
Convo
lve
Mat
rixM
ultiply
Line
ar
Sepia
Smith
watSvm
Geo
-Mea
n
Sp
ee
du
p o
ve
r S
eri
al
CPU-always GPU-always
Manual mapping Adaptive mapping
Original result
Using a different compiler on CPU ICC => GCC
(for both the serial and parallel cases)
GCC doesn’t use SSE-x as well as ICC does
Adaptive mapping biases to GPU