sage: self-tuning approximation for graphics engines

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

2

Approximate Computing

• Different domains:– Machine Learning– Image Processing– Video Processing– Physical Simulation– …

Higher performanceLower power consumption

Less work

Quality: 100% 95%

90% 85%

3

Ubiquitous Graphics Processing Units

Super Computers Desktops Cell Phones

• Wide range of devices

• Mostly regular applications• Works on large data sets

Servers

Good opportunity for automatic approximation

4

SAGE Framework

• Simplify or skip processing• Computationally expensive• Lowest impact on the output quality

Output Error

Speedup

10%

10%

2~3x

• Self-Tuning Approximation on Graphics Engines• Write the program once• Automatic approximation• Self-tuning dynamically

5

Overview

6

SAGE FrameworkInput Program

ApproximationMethods

Approximate Kernels

Tuning Parameters

Static Compiler

Runtime system

7

Static Compilation

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

8

Runtime System

Preprocessing

Tuning Execution Calibration

CPU

GPU

Time

Quality

Speedup

TOQ

Tuning T T + C T + 2C

Tuning

Preprocessing

Calibration

9

Runtime System

Preprocessing

Tuning Execution Calibration

CPU

GPU

Time

Quality

Speedup

TOQ

Tuning T T + C T

10

Approximation Methods

11


Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

12

Atomic Operations

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

• Atomic operations update a memory location such that the update appears to happen atomically

13

Atomic Operations

Threads

14

Atomic Operations

Threads

15

Atomic Operations

Threads

16

Atomic Operations

Threads

17

Atomic Add

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

2

4

6

8

10

12

14

Conflicts per Warp

Slowdown

18

Atomic Operation Tuning

Iterations

T0 T32 T64 T96 T128 T160

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

19


Iterations

T0 T32 T64 T96 T128 T160

• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

20


T0 T32 T64 T96 T128 T160

It drops 50% of iterations• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

21


T0 T32 T64

Drop rate goes down to 25%

22

Dropping One Iteration

0Conflicts

2

8

17

12

Iteration No.

1

2

3

After optimizationCD 0

0

Conflict Detection

1

3

CD 1

CD 2

CD 3

Max conflictIteration 0Iteration 1Iteration 2

23


Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

24

Data PackingThreads

Memory

25

Data PackingThreads

Memory

26

Data PackingThreads

Memory

27

Data PackingThreads

Memory

28

Quantization• Preprocessing finds min and max of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

101

Quantization Levels

29

Quantization• Preprocessing finds max and min of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

101

30


Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

31

Thread Fusion

Threads

Memory

Memory ≈ ≈ ≈ ≈

32

Thread Fusion

Threads

Memory

Memory

33

Thread Fusion

T0

Computation

Output writing

T1

ComputationOutput writing

34

Thread Fusion

T0 T1 T254 T255 T0 T1 T254 T255

Block 0 Block 1

ComputationOutput writing

35

Thread Fusion

T0 T127

Block 0 Block 1

T0 T127

Reducing the number of threads per block results in poor resource utilization

36

Block Fusion

T0 T254

Block 0 & 1 fused

T255T1

37

Runtime

38

How to Compute Output Quality?

Accurate Version

ApproximateVersion

EvaluationMetric

• High overhead• Tuning should find a good enough

approximate version as fast as possible• Greedy algorithm

39

Tuning

K(0,0) Quality = 100%Speedup = 1x

TOQ = 90%

K(x,y)

Tuning parameter of the First optimization

Tuning parameter of the Second optimization

40

Tuning


K(1,0) K(0,1)94%1.15X

96%1.5X

TOQ = 90%

41

Tuning


K(1,0) K(0,1)

K(1,1) K(0,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

TOQ = 90%

42

Tuning


K(1,0) K(0,1)

K(1,1) K(0,2)

K(2,1) K(1,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

88%3X

87%2.8X

Final ChoiceTuning Path

TOQ = 90%

43

Evaluation

44

Experimental Setup

• Backend of Cetus compiler

• GPU– NVIDIA GTX 560

• 2GB GDDR 5

• CPU– Intel Core i7

• Benchmarks– Image processing– Machine Learning

45

K-Means

80

90

100

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1011061111160

1

2

Output Quality

Accumulative Speedup

TOQ

46

Confidence

0 50 100 150 200 250 300 350 40040

50

60

70

80

90

100

Calibration points

93

CI = 99%

CI = 95%

CI = 90%

After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝𝑝𝑟𝑖𝑜𝑟× h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑

Beta Uniform Binomial

47

Calibration Overhead

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

Calibration Interval

Gaussian

K-Means

Calibration Overhead(%)

48

K-Means

NB

Histogram

SVM

Fuzzy

MeanShift

Binarization

Dynamic

Mean Filter

Gaussian

1 2 3 4

2.0

1.8

3.6

1.4

1.7

2.3

2.2

1.3

3.3

1.7

1.7

1.3

1.3

1.8

1.5

1.6

Speedup 1 2 3 4

2.5

2.2

6.4

1.4

1.8

2.7

2.2

1.3

3.3

3.1

3.1

1.4

1.6

3.5

1.3

1.3

1.5

1.4

1.3

Speedup

Performance

6.4

SAGE SAGE

TOQ = 95% TOQ = 90%Loop Perforation Loop Perforation

GeometricMean

49

Conclusion• Automatic approximation is possible

• SAGE automatically generates approximate kernels with different parameters

• Runtime system uses tuning parameters to control the output quality during execution

• 2.5x speedup with less than 10% quality loss compared to the accurate execution

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

51

Backup Slides

52

What Does Calibration Miss?

10 20 30 40 50 60 70 80 90 1000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

flower(250) grandma(800) deadline(900) foreman(300) harbour(300)

ice(240) akiyo(300) carphone(380) crew(300) galleon(350)

Calibration Interval

Perc

ent D

iffer

ence

in E

rror

53

SAGE Can Control The Output Quality

909192939495969798991001

2

3

4

5

6

7Naïve Bayes Fuzzy Kmeans Mean Filter

Output Quality(%)

Spee

dup

54

Distribution of Errors

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

HistogramKmeansNaïve BayesFuzzy KmeansSVMDynamicMean FilterGaussianBinarizationMeanshift

Error

Perc

enta

ge o

f Ele

men

ts

sage: self-tuning approximation for graphics engines

Documents