1 qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping chi-keung...

1

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with

Adaptive Mapping

Chi-Keung (CK) LukChi-Keung (CK) LukTechnology Pathfinding and InnovationTechnology Pathfinding and InnovationSoftware Solutions and Services GroupSoftware Solutions and Services Group

IntelIntel

Sunpyo HongSunpyo HongElectrical and Computer Electrical and Computer

EngineeringEngineeringGeorgia Institute of TechnologyGeorgia Institute of Technology

Hyesoon KimHyesoon KimCollege of ComputingCollege of Computing

School of Computer ScienceSchool of Computer ScienceGeorgia Institute of TechnologyGeorgia Institute of Technology

MICRO’09

2Heterogeneous Architectures

Heterogeneous architectures are increasingly popular:

Intel Core2 + Nvidia’s GPU

IBM’s Cell processor

Platform used

NHM + Larrabee

MICRO’09

3Software Challenge

GPU

Core-0 Core-1

Core-2 Core-3

CPU

SIMD

A CPU + GPU system:

The Mapping Problem:

Map computations to PEs to optimize an objective function, which could be:

• Performance

• Energy

• Performance / Energy

MICRO’09

4Existing Solutions to the Mapping

Problem

Programmer performs the mapping manually and statically

Examples: IBM XL compiler extension that supports

OpenMP on the Cell Intel CTG’s ExoCHI/Merge framework for

programming the CPU and GPU Disadvantages:

Labor intensive Not adaptable to changes in runtime

environments

MICRO’09

5Outline

IntroductionCase StudyAdaptive MappingExperimental EvaluationConclusions

MICRO’09

6

Case Study: Matrix Multiplication

Heterogeneous machine used: CPU: dual-socket QuadCore (max = 8 cores) GPU: Nvidia GTX-8800 GPU

Three configurations tested:1. Small problem size, max CPU cores used2. Big problem size, max CPU cores used3. Big problem size, fewer CPU cores used

In each configuration:– Perform cooperative matrix multiplication

(varying the distribution of works over the CPU and GPU)

MICRO’09

7Cooperative Matrix Multiplication

C = A Bx

C1

= BxA1

C2 A2

CPU

GPU

MICRO’09

8Cooperative Matrix Multiplication Results

5.1 5.2 5.7 6.17

5.74.7 4.1 3.6 3.2

7.7

0x

2x

4x

6x

8x

10x

GPU-o

nly

90/10

80/20

70/30

60/40

50/50

40/60

30/70

20/80

10/90

CPU-o

nly

Sp

ee

du

p o

ve

r S

eri

al

Configuration 1:

Matrix dimension size = 1000

#CPU cores = 8

Configuration 3:


#CPU cores = 2

7.68.4

6.5

54

3.3 2.8 2.5 2.2 2

9.3

0x

2x

4x

6x

8x

10x

GPU-onl

y90

/1080

/2070

/3060

/4050

/5040

/6030

/7020

/8010

/90

CPU-onl

y

Sp

ee

du

p o

ve

r S

eri

al

7.6 8 8.79.7

97.7

6.75.9 5.3

7.4

10.3

0x

2x

4x

6x

8x

10x

12x

Sp

eed

up

ove

r S

eria

l

Configuration 2:


#CPU cores = 8

Lessons Learned:

The optimal PE mapping depends on the application, the input size, and hardware resources available

Need an automatic and dynamic technique that takes all these factors into account

Our contribution: ADAPTIVE MAPPING

MICRO’09

9Adaptive Mapping

A technique to automatically find the near-optimal mapping for the given program, problem size and hardware

Each <program, hardware> configuration involves one training run and many reference runs: Training run:

• Find the execution-time projections of the CPU and the GPU for the given configuration

Reference run:• Compute the near-optimal distribution of work

for the current problem size

MICRO’09

10

Training Run

Kernel K

NtN1,1 N1,m

K K K K

N2,1 N2,m

KKKK

time taken: Tc (N1,1)

curve fitting

Input sizeRu

nti

me

curve fitting

T’C(N) T’G(N)

Tc (N1,m) TG(N2,1) TG(N2,m)

Database

T’C(N) = The projected time to executethe kernel of problem size N on the CPU

= ac + bc * N

T’G(N) = The projected time to executethe kernel of problem size N on the GPU

= ag + bg * N

MICRO’09

11

Reference Run

Database

T’(Nr) = Max( p/(p-1)T’C(βNr), T’G((1-β)Nr) )

Find β to minimize T’(Nr)

K

CPU

K

GPU

β = Fraction of work mapped to CPU p = Number of CPU cores N = Problem size

T’β(N) = The projected time to execute β N work on the CPU and (1- β)N work on the GPU

= Max( p/(p-1)T’C(βN), T’G((1-β)N) )

Once N is fixed to the actual problem size Nr, we find the β that minimizes T’β(Nr).

We consider where the two curves

p/(p-1)T’C(βNr) and T’G((1-β)Nr) intersect.

There are 3 possible cases (see next slide)

K

Nr

MICRO’09

12

Three Possible Cases of β

Minimized when mapping all work to the GPU

Time

0 1

CPU: (p/p-1)T’c(Nr)

GPU: T’G((1-Nr)

Case i: CPU and GPU curves intersect at β <= 0 Case ii: The two curves intersect at β >= 1Time

0 1


GPU: T’G((1-Nr)

Minimized when mapping all work to the CPU

Case iii: The two curves intersect at 0<β<1

0 1


GPU: T’G((1-Nr)Minimized when mapping min of work to the CPU

min

MICRO’09

13Outline

IntroductionCase StudyAdaptive MappingExperimental EvaluationConclusions

MICRO’09

14Prototype Implementation

Adaptive mapping could be implemented as: Off-line optimization for static compilation On-line optimization for dynamic compilation

Our prototype: A dynamic compilation system called Qilin Qilin API:

• Both stream-based and thread-based Dynamic code generation:

• Generate TBB source code for the CPU• Generate CUDA source code for the GPU• Generate glue code to:

• Copy data back and forth between CPU and GPU• Stage computations onto GPU to satisfy GPU memory limitation• Division of work according to Adaptive Mapping

C++ App

Qilin APIQilin System

CPU GPU

MICRO’09

15Heterogeneous PC used

CPU GPU

Architecture Intel Core2 Quad Nvidia 8800 GTX

Core Clock 2.4 GHz 1.35GHz

Number of Cores 8 cores (on 2 sockets)

128 stream processors

Memory Size 4 GB 768 MB

Memory Bandwidth 8 GB/s 86.4 GB/s

Threading API Intel TBB Nvidia CUDA

Compiler ICC 10.1 NVCC 1.1

OS 32-bit Linux Fedora Core 6

MICRO’09

16

Benchmarks

Name Description Source

Binomial American option pricing CUDA SDK

BlackScholes European option pricing CUDA SDK

Convolve 2D separable image convolution CUDA SDK

MatrixMultiply Dense matrix multiplication CUDA SDK

Linear Linear image filter---compute output pixel as average of a 9-pixel square

Intel’s Merge

Sepia Modify RGB value to artificially age images Merge

Smithwat Compute scoring matrix for a pair of DNA sequences

Merge

Svm Kernel from a SVM-based face classifier Merge

(Financial, image processing, scientific)

MICRO’09

17

Adaptive mapping achieves 94% of the speedup of manual mapping

5.5 7

9.9

9.3

1x

10x

100x

Binom

ial

Black

Scholes

Convo

lve

Mat

rixM

ultiply

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al

CPU-always GPU-always

Manual mapping Adaptive mapping

Performance of Adaptive Mapping

(Note: The y-axis is in logarithmic scale)

MICRO’09

18Energy Consumption

100.

063

.349

.251

.0

0.0

50.0

100.0

150.0

200.0

250.0

300.0

Binom

ial

BlackS

chole

s

Convo

lve

Mat

rixM

ultipl

y

Linea

r

Sepia

Smith

watSvm

Geo-M

ean

No

rma

lize

d E

ne

rgy

Co

ns

um

pti

on

(%

)

CPU-always GPU-always Manual mapping Adaptive mapping

Adaptive mapping is nearly as good as manual mapping in energy consumption

(Total system power measured by Extech 38080 Power Analyser)

MICRO’09

19Distribution of Computations


CPU GPU CPU GPU

Binomial 10% 90% 10.5% 89.5%

BlackScholes 40% 60% 46.5% 53.5%

Convolve 40% 60% 36.3% 63.7%

MatrixMultiply 40% 60% 45.5% 54.5%

Linear 60% 40% 50.8% 49.2%

Sepia 80% 20% 76.2% 23.8%

Smithwat 60% 40% 59.3% 40.7%

Svm 10% 90% 14.3% 85.7%

Adaptive mapping and manual mapping have similar distributions

MICRO’09

20Related Work

Hardware Kumar et al. demonstrate advantages of heterogeneous over

homogeneous CMPs in terms of power and throughput Similar observations from Hill and Mart

=> Both study point out the importance of the mapping problemSoftware

GPGPU: • Brook, Accelerator, Peakstream, Rapidmind, Brook+, Cuda

(they are all GPU only) Intel’s TBB and Ct (currently CPU only) IBM’s OpenMP extension for Cell and Intel’s ExoCHI/Merge

• Use both CPU and GPU, but based on static manual mapping OpenCL:

• Doesn’t seem to have any automatic mapping technique based on the initial specification

Autotuning• Generating many variants of a computation kernel and

benchmarking each variant on the target platform• Adaptive mapping can be regarded as an autotuning technique

that tunes for the distribution of works on heterogeneous platforms

MICRO’09

21

Conclusions Automates the mapping from computations

to heterogeneous multicores Encouraging results:

Performance and energy consumption close to manual mapping

Adapt to changes in input size, hardware & software configurations (see our paper)

Applicable to other heterogeneous systems OpenCL or Ct on NHM + Larrabee

Future work: Extend it to handle irregular computations

Adaptive mapping could be an important technique in the multicore software stack

MICRO’09

22Acknowledgments

Michael Linderman, Jamison Collins, Hong Wang Sharing their Merge benchmarks

Geoff Lowney and Mark Abel Support of this work

Geoff Lowney and Robert Cohn Suggestions and feedbacks

MICRO’09

23

MICRO’09

24Impact of Training Input Size

9.3

9.3

9.2

98.

27.

5

1x

10x

100x

Binom

ial

BlackS

choles

Convolve

Mat

rixM

ultipl

y

Linear

Sepia

Smith

watSvm

Geo-M

ean

Sp

ee

du

p o

ve

r S

eri

al 100% 80% 50% 30% 20% 10%

(Note: The y-axis is in logarithmic scale)Training input size as percentage of the reference input size

Most of the performance benefit of Adaptive Mapping preserved when the training input size is at

least 30% of the reference input size

MICRO’09

25Adapting to Hardware Changes (1)

5.5 5.7 8.2

1x

10x

100x

Binom

ial

Black

Schole

s

Convo

lve

Mat

rixM

ultip

ly

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al CPU-always

GPU-alwaysAdaptive mapping

Using a less powerful GPU (GTX8800 with 128 cores => GTS8800 with 96 cores)

Adaptive mapping automatically recovers part of

the performance loss in the GPU from the CPU 5.

5 79.

99.

3

1x

10x

100x

Binom

ial

Black

Scholes

Convo

lve

Mat

rixM

ultiply

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al



Original result

MICRO’09

26Adapting to Hardware Changes (2)

1.5

7 7.2

0x

1x

10x

100x

Binom

ial

Black

Schole

s

Convo

lve

Mat

rixM

ultip

ly

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al

CPU-always

GPU-always

Adaptive mapping

5.5 7

9.9

9.3

1x

10x

100x

Binom

ial

Black

Scholes

Convo

lve

Mat

rixM

ultiply

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al



Original result

Using a less powerful CPU (CPU with 8 cores => CPU with 2 cores)

Adaptive mapping shifts most work to the GPU

MICRO’09

27Adapting to Software Changes

7.1

13.6 16

.1

1x

10x

100x

Binom

ial

Black

Schole

s

Convo

lve

Mat

rixM

ultip

ly

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al CPU-always GPU-always Adaptive mapping

5.5 7

9.9

9.3

1x

10x

100x

Binom

ial

Black

Scholes

Convo

lve

Mat

rixM

ultiply

Line

ar

Sepia

Smith

watSvm

Geo

-Mea

n

Sp

ee

du

p o

ve

r S

eri

al



Original result

Using a different compiler on CPU ICC => GCC

(for both the serial and parallel cases)

GCC doesn’t use SSE-x as well as ICC does

Adaptive mapping biases to GPU

1 qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping chi-keung...

Documents