offload annotations: bringing heterogeneous computing to … · offload annotations: bringing...

50
Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia Stanford University USENIX ATC 2020 (July 15-17)

Upload: others

Post on 21-Sep-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Offload Annotations: Bringing Heterogeneous Computing to

Existing Libraries and WorkloadsGina Yuan, Shoumik Palkar, Deepak Narayanan, Matei Zaharia

Stanford UniversityUSENIX ATC 2020 (July 15-17)

Page 2: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Background: Hardware Commoditization

2

NVIDIA GPU

Page 3: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Background: CPUs vs. GPUs

Core Core

Core CoreControl

Cache

Memory Memory

4-way parallelism512GB memory

1000-ways parallelism!16GB memory

CPUs GPUs

(PCI-E)

Costly data transfers!

3

Page 4: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Background: Data Science on the CPU

+(CPU)

4Popular Python data science libraries for the CPU.

Page 5: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

NVIDIA GPU

Trend: Data Science on the GPU

+

5NEW Python data science libraries for the GPU.

(cuDF, cuML, etc.)

Lots of parallel data!

Page 6: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Trend: CPU Libraries vs. GPU Librarieshttps://github.com/rapidsai/cudf https://cupy.chainer.org/

https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html

https://github.com/rapidsai/cuml

6

Page 7: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Trend: CPU Libraries vs. GPU Librarieshttps://github.com/rapidsai/cudf https://cupy.chainer.org/

https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html

https://github.com/rapidsai/cuml

Are GPU libraries as straightforward to use as they seem?

7

Page 8: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Motivating Example

8

cuML

Page 9: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

cumlcuml

Motivating Example

Missing Functions

9

Page 10: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

cumlcuml

Motivating Example

Missing Functions

Manual Data Transfers

X_train = transfer(X_train, GPU)

Y_train = transfer(Y_train, GPU)

X_test = transfer(X_test, GPU)

result = transfer(result, CPU)

10

Page 11: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

cumlcuml

Motivating Example

Missing Functions

Manual Data Transfers

Small GPU Memory

X_train = transfer(X_train, GPU)

Y_train = transfer(Y_train, GPU)

X_test[i,j]=transfer(X_test[i,j], GPU)

result[i,j]=transfer(result[i,j], CPU)

for (i,j) in split(X_test):

[i,j] [i,j])[i,j] [i,j])

[i,j] [i,j])

11

Page 12: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

cumlcuml

Motivating Example

Missing Functions

Manual Data Transfers

Small GPU Memory

Scheduling

X_train = transfer(X_train, GPU)

Y_train = transfer(Y_train, GPU)

X_test[i,j]=transfer(X_test[i,j], GPU)

result[i,j]=transfer(result[i,j], CPU)

for (i,j) in split(X_test):

[i,j] [i,j])[i,j] [i,j])

[i,j] [i,j])

??????

12

Page 13: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Solution: Offload AnnotationsThe annotator writes offload annotations (OAs) for CPU libraries. An end user imports the annotated library instead of the CPU library. Our runtime, Bach, automatically schedules data transfers and pages computation.

13

Page 14: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

With less developer effort:1. Match handwritten GPU performance

Goals

14

Page 15: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

With less developer effort:1. Match handwritten GPU performance2. Scale to data sizes larger than GPU memory

Goals

15

Page 16: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

With less developer effort:1. Match handwritten GPU performance2. Scale to data sizes larger than GPU memory3. Beat CPU performance

Goals

16

Page 17: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

multiply = @oa(func=torch.mul)(np.multiply)sqrt = @oa(func=torch.sqrt)(np.sqrt)

Step 1: Annotator – Function Annotations

17

GPU library CPU library

Page 18: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

multiply = @oa(func=torch.mul)(np.multiply)sqrt = @oa(func=torch.sqrt)(np.sqrt)

Step 1: Annotator – Function Annotations

18

GPU library CPU library

corresponding functions

Page 19: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

inputs outputs

arg = (NdArrayType(),)args = (NdArrayType(), NdArrayType())ret = NdArrayType()

multiply = @oa(args, ret, func=torch.mul)(np.multiply)sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)

Step 1: Annotator – Function Annotations

19

Page 20: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

arg = (NdArrayType(),)args = (NdArrayType(), NdArrayType())ret = NdArrayType()

multiply = @oa(args, ret, func=torch.mul)(np.multiply)sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)ones = @oa_alloc(ret, func=torch.ones)(np.ones)

Step 1: Annotator – Allocation Annotations

20

Allocations only have a return type.

Page 21: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

arg = (NdArrayType(),)args = (NdArrayType(), NdArrayType())ret = NdArrayType()

multiply = @oa(args, ret, func=torch.mul)(np.multiply)sqrt = @oa(arg, ret, func=torch.sqrt)(np.sqrt)ones = @oa_alloc(ret, func=torch.ones)(np.ones)

Step 1: Annotator – Allocation Annotations

21

What’s in an offload split type?

"offload split type"

Page 22: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

API Description

device(value) Which device the value is on.

to(value, device) Transfers [value] to [device].offloading

API

Step 1: Annotator – Offload Split Type

22

Page 23: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

API Description

device(value) Which device the value is on.

to(value, device) Transfers [value] to [device].offloading

API

Step 1: Annotator – Offload Split Type

23

API Implementation for NdArrayType()

device(value) ...if isinstance(value, torch.Tensor): ...

to(value, device) ...value.to(torch.device('cpu')).numpy()

Page 24: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

API Descriptionsize(value) Number of elements in the value.split(start, end, value) Splits a value to enable paging.

merge(values) Merges split values.

splitting API[Mozart

SOSP ‘19](optional)

Step 1: Annotator – Offload Split Type

24

Page 25: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

API Descriptionsize(value) Number of elements in the value.split(start, end, value) Splits a value to enable paging.

merge(values) Merges split values.

splitting API[Mozart

SOSP ‘19](optional)

Step 1: Annotator – Offload Split Type

25

API Implementation for NdArrayType()size(value) return value.shape[-1]split(start, end, value) return value[start, end]merge(values) return np.concatenate(values)

Page 26: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Step 1: Annotator – Offload Split Type

26

DataFrameType() ModelType()NdArrayType()

Page 27: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

import numpy as np# Allocatea = np.ones(size, dtype='float64')b = np.ones(size, dtype='float64’)# Computenp.arcsin(a, out=a)np.multiply(a, b, out=b)np.sqrt(b, out=b)

Step 2: End User

(Simple, somewhat dumb, Python program.)

27

End User ≠Annotator

Page 28: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

import bach.numpy as np# Allocatea = np.ones(size, dtype='float64')b = np.ones(size, dtype='float64’)# Computenp.arcsin(a, out=a)np.multiply(a, b, out=b)np.sqrt(b, out=b)

Import the annotated library instead of the CPU library.

Step 2: End User

28

Page 29: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

import bach.numpy as np# Allocatea = np.ones(size, dtype='float64')b = np.ones(size, dtype='float64’)# Computenp.arcsin(a, out=a)np.multiply(a, b, out=b)np.sqrt(b, out=b)

np.evaluate()

Step 2: End User

Explicitly materialize lazy values with included evaluate() function.

29

Page 30: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Generate a lazy computation graph and do a topological sort.

Step 3: Runtime - Scheduling

?

?

?

?

?

1

2

4

5

3

np.sqrt(b)

12

45

3

a = np.ones()np.arcsin(a)

b = np.ones()np.multiply(a,b)

= Allocation

= CPU= GPU

30

Bach

Page 31: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Assign functions to the CPU/GPU based on whether a GPU library implementation is provided in the annotation.

Step 3: Runtime - Scheduling

?

?

?

1

2

4

5

3

12

45

3

= Allocation

= CPU= GPU

No.

GPU library imple-

mentation provided?

Yes.

Yes. np.sqrt(b)

a = np.ones()np.arcsin(a)

b = np.ones()np.multiply(a,b)

31

Bach

Page 32: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Assign allocations to the CPU/GPU so they are on the same device as the first function that uses the data.

Step 3: Runtime - Scheduling

?

?

?

1

2

4

5

3

12

45

3

= Allocation

= CPU= GPU

np.sqrt(b)

a = np.ones()np.arcsin(a)

b = np.ones()np.multiply(a,b)

32

Bach

Page 33: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Automatically transfer data using the offloading API.

Step 3: Runtime – Offloading API

?

?

?

1

2

4

5

3

12

45

3

-- transfer to CPU --

-- transfer to GPU --

= Allocation

= CPU= GPU

np.sqrt(b)

a = np.ones()np.arcsin(a)

b = np.ones()np.multiply(a,b)

33

Bach

Page 34: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

np.sqrt(b)

a = np.ones()np.arcsin(a)

b = np.ones()np.multiply(a,b)

Automatically page large datasets using the splitting API.

Step 3: Runtime – Splitting API

?

?

?

1

2

4

5

3

12

45

3

Split

Merge

-- transfer to CPU --

-- transfer to GPU --

Data Size = 2^28

= Allocation

= CPU= GPU

34

Bach

Page 35: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Naive cost-benefit analysis between data transfer and computation cost.

Step 3: Runtime – Scheduling Heuristics

?

?

?

?

1

2

4

5

3

= Allocation

= CPU= GPU

GPU Compute + Data Transfer

CPU Compute

Data Size = 2^28

(optional)

35

Page 36: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

?

?

?

?

1

2

4

5

3

= Allocation

= CPU= GPU

GPU Compute + Data Transfer

CPU Compute

Data Size = 2^10

36

Naive cost-benefit analysis between data transfer and computation cost.

Step 3: Runtime – Scheduling Heuristics (optional)

Page 37: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Naive implementations of cost estimators.

Estimated Cost

Data Size

CPU Compute GPU Compute

+ Data Transfer

2^10 2^28

CPU GPU

37

Step 3: Runtime – Scheduling Heuristics (optional)

Page 38: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Evaluation4 library integrations and 8 data science and ML workloads.

38

Page 39: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

~130 LOC per library including offloading / splitting APIs and function annotations.

Integration Experience

39

Page 40: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Speedup: max 1200x, median 6.3x.

Evaluation: Summary

40

Page 41: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Evaluation: Summary

41

Page 42: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

With less developer effort, Bach can:1. Match handwritten GPU

performance

Evaluation: Summary

42

Page 43: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

With less developer effort, Bach can:1. Match handwritten GPU

performance2. Scale to data sizes larger than

GPU memory

Evaluation: Summary

43

Page 44: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

With less developer effort, Bach can:1. Match handwritten GPU

performance2. Scale to data sizes larger than

GPU memory3. Beat CPU performance

Evaluation: Summary

44

2.3x

6.8x

Page 45: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Crime Index saves time by eliminating the initial data transfer, while the allocation still fits in GPU memory.

In-Depth Evaluation: Allocations

45

4.6x

1.1x

Page 46: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

At smaller data sizes, TSVD schedules all computation on the CPU.

In-Depth Evaluation: Heuristics

46

11x

Page 47: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

[Motivating Example]The "fit" phase dominates the runtime until the "predict" phase can split/page data into the GPU.

In-Depth Evaluation: Splitting/Paging Datasets

47

6.2x

2.0x

Page 48: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Evaluation: Summary

1.7x

6.9x 0.81x5.7x

1200x 6.8x

4.6x

11x

Max Speedup

48

Page 49: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

Evaluation: Summary

1.7x

6.9x 0.81x5.7x

1200x 6.8x

4.6x

11x

Max Speedup

49

Page 50: Offload Annotations: Bringing Heterogeneous Computing to … · Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads Gina Yuan, Shoumik Palkar,

OAs enable heterogeneous GPU computing in existing libraries and workloads with little to no code modifications.

Conclusion

github.com/stanford-futuredata/[email protected]

With less developer effort, Bach + OAs can:§ Match handwritten GPU performance§ Scale to data sizes larger than GPU memory§ Beat CPU performance

50