sage: self-tuning approximation for graphics engines

54
SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi 1 , Janghaeng Lee 1 , D. Anoushe Jamshidi 1 , Amir Hormati 2 , and Scott Mahlke 1 University of Michigan 1 , Google Inc. 2 December 2013 Compilers creating custom processors University of Michigan Electrical Engineering and Computer Science

Upload: toviel

Post on 22-Feb-2016

70 views

Category:

Documents


0 download

DESCRIPTION

SAGE: Self-Tuning Approximation for Graphics Engines. Mehrzad Samadi 1 , Janghaeng Lee 1 , D. Anoushe Jamshidi 1 , Amir Hormati 2 , and Scott Mahlke 1 University of Michigan 1 , Google Inc. 2 December 2013. University of Michigan Electrical Engineering and Computer Science. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SAGE: Self-Tuning Approximation for Graphics Engines

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

Page 2: SAGE: Self-Tuning Approximation for Graphics Engines

2

Approximate Computing

• Different domains:– Machine Learning– Image Processing– Video Processing– Physical Simulation– …

Higher performanceLower power consumption

Less work

Quality: 100% 95%

90% 85%

Page 3: SAGE: Self-Tuning Approximation for Graphics Engines

3

Ubiquitous Graphics Processing Units

Super Computers Desktops Cell Phones

• Wide range of devices

• Mostly regular applications• Works on large data sets

Servers

Good opportunity for automatic approximation

Page 4: SAGE: Self-Tuning Approximation for Graphics Engines

4

SAGE Framework

• Simplify or skip processing• Computationally expensive• Lowest impact on the output quality

Output Error

Speedup

10%

10%

2~3x

• Self-Tuning Approximation on Graphics Engines• Write the program once• Automatic approximation• Self-tuning dynamically

Page 5: SAGE: Self-Tuning Approximation for Graphics Engines

5

Overview

Page 6: SAGE: Self-Tuning Approximation for Graphics Engines

6

SAGE FrameworkInput Program

ApproximationMethods

Approximate Kernels

Tuning Parameters

Static Compiler

Runtime system

Page 7: SAGE: Self-Tuning Approximation for Graphics Engines

7

Static Compilation

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 8: SAGE: Self-Tuning Approximation for Graphics Engines

8

Runtime System

Preprocessing

Tuning Execution Calibration

CPU

GPU

Time

Quality

Speedup

TOQ

Tuning T T + C T + 2C

Tuning

Preprocessing

Calibration

Page 9: SAGE: Self-Tuning Approximation for Graphics Engines

9

Runtime System

Preprocessing

Tuning Execution Calibration

CPU

GPU

Time

Quality

Speedup

TOQ

Tuning T T + C T

Page 10: SAGE: Self-Tuning Approximation for Graphics Engines

10

Approximation Methods

Page 11: SAGE: Self-Tuning Approximation for Graphics Engines

11

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 12: SAGE: Self-Tuning Approximation for Graphics Engines

12

Atomic Operations

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

• Atomic operations update a memory location such that the update appears to happen atomically

Page 13: SAGE: Self-Tuning Approximation for Graphics Engines

13

Atomic Operations

Threads

Page 14: SAGE: Self-Tuning Approximation for Graphics Engines

14

Atomic Operations

Threads

Page 15: SAGE: Self-Tuning Approximation for Graphics Engines

15

Atomic Operations

Threads

Page 16: SAGE: Self-Tuning Approximation for Graphics Engines

16

Atomic Operations

Threads

Page 17: SAGE: Self-Tuning Approximation for Graphics Engines

17

Atomic Add

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 310

2

4

6

8

10

12

14

Conflicts per Warp

Slowdown

Page 18: SAGE: Self-Tuning Approximation for Graphics Engines

18

Atomic Operation Tuning

Iterations

T0 T32 T64 T96 T128 T160

// Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

Page 19: SAGE: Self-Tuning Approximation for Graphics Engines

19

Atomic Operation Tuning

Iterations

T0 T32 T64 T96 T128 T160

• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

Page 20: SAGE: Self-Tuning Approximation for Graphics Engines

20

Atomic Operation Tuning

T0 T32 T64 T96 T128 T160

It drops 50% of iterations• SAGE skips one iteration per thread• To improve the performance, it drops the iteration with the maximum number of conflicts

Page 21: SAGE: Self-Tuning Approximation for Graphics Engines

21

Atomic Operation Tuning

T0 T32 T64

Drop rate goes down to 25%

Page 22: SAGE: Self-Tuning Approximation for Graphics Engines

22

Dropping One Iteration

0Conflicts

2

8

17

12

Iteration No.

1

2

3

After optimizationCD 0

0

Conflict Detection

1

3

CD 1

CD 2

CD 3

Max conflictIteration 0Iteration 1Iteration 2

Page 23: SAGE: Self-Tuning Approximation for Graphics Engines

23

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 24: SAGE: Self-Tuning Approximation for Graphics Engines

24

Data PackingThreads

Memory

Page 25: SAGE: Self-Tuning Approximation for Graphics Engines

25

Data PackingThreads

Memory

Page 26: SAGE: Self-Tuning Approximation for Graphics Engines

26

Data PackingThreads

Memory

Page 27: SAGE: Self-Tuning Approximation for Graphics Engines

27

Data PackingThreads

Memory

Page 28: SAGE: Self-Tuning Approximation for Graphics Engines

28

Quantization• Preprocessing finds min and max of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

101

Quantization Levels

Page 29: SAGE: Self-Tuning Approximation for Graphics Engines

29

Quantization• Preprocessing finds max and min of the input

sets and packs the data

• During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation

Min Max

0 1 2 3 4 5 6 7

101

Page 30: SAGE: Self-Tuning Approximation for Graphics Engines

30

Approximation Methods

Evaluation Metric

Target Output

Quality (TOQ)

CUDA Code

Atomic Operation

Data Packing

ThreadFusion

Approximate Kernels

Tuning Parameters

Page 31: SAGE: Self-Tuning Approximation for Graphics Engines

31

Thread Fusion

Threads

Memory

Memory ≈ ≈ ≈ ≈

Page 32: SAGE: Self-Tuning Approximation for Graphics Engines

32

Thread Fusion

Threads

Memory

Memory

Page 33: SAGE: Self-Tuning Approximation for Graphics Engines

33

Thread Fusion

T0

Computation

Output writing

T1

ComputationOutput writing

Page 34: SAGE: Self-Tuning Approximation for Graphics Engines

34

Thread Fusion

T0 T1 T254 T255 T0 T1 T254 T255

Block 0 Block 1

ComputationOutput writing

Page 35: SAGE: Self-Tuning Approximation for Graphics Engines

35

Thread Fusion

T0 T127

Block 0 Block 1

T0 T127

Reducing the number of threads per block results in poor resource utilization

Page 36: SAGE: Self-Tuning Approximation for Graphics Engines

36

Block Fusion

T0 T254

Block 0 & 1 fused

T255T1

Page 37: SAGE: Self-Tuning Approximation for Graphics Engines

37

Runtime

Page 38: SAGE: Self-Tuning Approximation for Graphics Engines

38

How to Compute Output Quality?

Accurate Version

ApproximateVersion

EvaluationMetric

• High overhead• Tuning should find a good enough

approximate version as fast as possible• Greedy algorithm

Page 39: SAGE: Self-Tuning Approximation for Graphics Engines

39

Tuning

K(0,0) Quality = 100%Speedup = 1x

TOQ = 90%

K(x,y)

Tuning parameter of the First optimization

Tuning parameter of the Second optimization

Page 40: SAGE: Self-Tuning Approximation for Graphics Engines

40

Tuning

K(0,0) Quality = 100%Speedup = 1x

K(1,0) K(0,1)94%1.15X

96%1.5X

TOQ = 90%

Page 41: SAGE: Self-Tuning Approximation for Graphics Engines

41

Tuning

K(0,0) Quality = 100%Speedup = 1x

K(1,0) K(0,1)

K(1,1) K(0,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

TOQ = 90%

Page 42: SAGE: Self-Tuning Approximation for Graphics Engines

42

Tuning

K(0,0) Quality = 100%Speedup = 1x

K(1,0) K(0,1)

K(1,1) K(0,2)

K(2,1) K(1,2)

94%1.15X

96%1.5X

94%2.5X

95%1.6X

88%3X

87%2.8X

Final ChoiceTuning Path

TOQ = 90%

Page 43: SAGE: Self-Tuning Approximation for Graphics Engines

43

Evaluation

Page 44: SAGE: Self-Tuning Approximation for Graphics Engines

44

Experimental Setup

• Backend of Cetus compiler

• GPU– NVIDIA GTX 560

• 2GB GDDR 5

• CPU– Intel Core i7

• Benchmarks– Image processing– Machine Learning

Page 45: SAGE: Self-Tuning Approximation for Graphics Engines

45

K-Means

80

90

100

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 1011061111160

1

2

Output Quality

Accumulative Speedup

TOQ

Page 46: SAGE: Self-Tuning Approximation for Graphics Engines

46

Confidence

0 50 100 150 200 250 300 350 40040

50

60

70

80

90

100

Calibration points

93

CI = 99%

CI = 95%

CI = 90%

After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝𝑝𝑟𝑖𝑜𝑟× h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑

Beta Uniform Binomial

Page 47: SAGE: Self-Tuning Approximation for Graphics Engines

47

Calibration Overhead

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

Calibration Interval

Gaussian

K-Means

Calibration Overhead(%)

Page 48: SAGE: Self-Tuning Approximation for Graphics Engines

48

K-Means

NB

Histogram

SVM

Fuzzy

MeanShift

Binarization

Dynamic

Mean Filter

Gaussian

1 2 3 4

2.0

1.8

3.6

1.4

1.7

2.3

2.2

1.3

3.3

1.7

1.7

1.3

1.3

1.8

1.5

1.6

Speedup 1 2 3 4

2.5

2.2

6.4

1.4

1.8

2.7

2.2

1.3

3.3

3.1

3.1

1.4

1.6

3.5

1.3

1.3

1.5

1.4

1.3

Speedup

Performance

6.4

SAGE SAGE

TOQ = 95% TOQ = 90%Loop Perforation Loop Perforation

GeometricMean

Page 49: SAGE: Self-Tuning Approximation for Graphics Engines

49

Conclusion• Automatic approximation is possible

• SAGE automatically generates approximate kernels with different parameters

• Runtime system uses tuning parameters to control the output quality during execution

• 2.5x speedup with less than 10% quality loss compared to the accurate execution

Page 50: SAGE: Self-Tuning Approximation for Graphics Engines

SAGE: Self-Tuning Approximation for Graphics

Engines

Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1

University of Michigan1, Google Inc.2

December 2013

Compilers creating custom processorsUniversity of MichiganElectrical Engineering and Computer Science

Page 51: SAGE: Self-Tuning Approximation for Graphics Engines

51

Backup Slides

Page 52: SAGE: Self-Tuning Approximation for Graphics Engines

52

What Does Calibration Miss?

10 20 30 40 50 60 70 80 90 1000%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

flower(250) grandma(800) deadline(900) foreman(300) harbour(300)

ice(240) akiyo(300) carphone(380) crew(300) galleon(350)

Calibration Interval

Perc

ent D

iffer

ence

in E

rror

Page 53: SAGE: Self-Tuning Approximation for Graphics Engines

53

SAGE Can Control The Output Quality

909192939495969798991001

2

3

4

5

6

7Naïve Bayes Fuzzy Kmeans Mean Filter

Output Quality(%)

Spee

dup

Page 54: SAGE: Self-Tuning Approximation for Graphics Engines

54

Distribution of Errors

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

HistogramKmeansNaïve BayesFuzzy KmeansSVMDynamicMean FilterGaussianBinarizationMeanshift

Error

Perc

enta

ge o

f Ele

men

ts