swayam - microsoft.com · machine learning as a service (mlaas) + = trained model 2. prediction...

90
Arpan Gujarati, Björn B. Brandenburg Sameh Elnikety, Yuxiong He Kathryn S. McKinley Swayam Distributed Autoscaling for Machine Learning as a Service 1

Upload: others

Post on 03-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Arpan Gujarati, Björn B. Brandenburg

Sameh Elnikety, Yuxiong He Kathryn S. McKinley

SwayamDistributed Autoscaling for Machine Learning as a Service

1

Page 2: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Machine Learning as a Service (MLaaS)

Data Science & Machine Learning

Amazon Machine Learning

Machine Learning

Google Cloud AI

2

Page 3: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Machine Learning as a Service (MLaaS)

Data Science & Machine Learning

Amazon Machine Learning

Machine Learning

Google Cloud AI

+ =Trained Model

Untrained model

Dataset

1. Training

+ =Trained Model

2. Prediction

QueryAnswer

2

Page 4: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Machine Learning as a Service (MLaaS)

+ =Trained Model

2. Prediction

QueryAnswer

Models are already trained and available for prediction

This work

2

Page 5: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Swayam

inside the MLaaS infrastructure

Distributed autoscaling

of the compute resourcesneeded for prediction serving + =

Trained Model

2. Prediction

QueryAnswer

3

Page 6: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Prediction serving (application perspective)

MLaaS Provider

Image classifier

"cat"

image

Application / End User

4

Page 7: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction

Prediction serving (provider perspective)

5

Page 8: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction Application / End User

(1) New prediction request for the

pink model

(2) A frontend receives the request

Multiple request dispatchers "Frontends"

Prediction serving (provider perspective)

5

Page 9: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction Application / End User

(1) New prediction request for the

pink model

(3) The request is dispatched to an

idle backend

(2) A frontend receives the request

(4) The backend fetches the pink model

Multiple request dispatchers "Frontends"

Prediction serving (provider perspective)

5

Page 10: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction Application / End User

(1) New prediction request for the

pink model

(3) The request is dispatched to an

idle backend

(2) A frontend receives the request

(4) The backend fetches the pink model (5) The request

outcome is predicted

(6) The response is sent back through

the frontend

Multiple request dispatchers "Frontends"

Prediction serving (provider perspective)

5

Page 11: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Lots of trained models!

MLaaS Provider Application / End UserApplication / End User

Multiple request dispatchers "Frontends"

Finite compute resources "Backends" for prediction

Prediction serving (objectives)

6

Page 12: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Lots of trained models!

MLaaS ProviderMLaaS Provider Application / End UserResource efficiency

Low latency, SLAs

Application / End UserApplication / End User

Multiple request dispatchers "Frontends"

Finite compute resources "Backends" for prediction

Prediction serving (objectives)

6

Page 13: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Static partitioning of trained models

7

Page 14: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS ProviderMLaaS Provider

Static partitioning of trained modelsThe trained models partitioned among the finite backends

7

Page 15: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS ProviderMLaaS Provider Application / End User

Multiple request dispatchers "Frontends"

Static partitioning of trained models

No need to fetch and install the pink model

The trained models partitioned among the finite backends

7

Page 16: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS ProviderMLaaS Provider Application / End User

Multiple request dispatchers "Frontends"

Problem: Not all models are used at all times

Static partitioning of trained models

No need to fetch and install the pink model

The trained models partitioned among the finite backends

7

Page 17: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS ProviderMLaaS Provider Application / End User

Multiple request dispatchers "Frontends"

Problem: Not all models are used at all times

Problem: Many more models than backends, high memory footprint per model

Static partitioning of trained models

No need to fetch and install the pink model

The trained models partitioned among the finite backends

7

Page 18: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS Provider

Static partitioning of trained modelsMLaaS Provider

Multiple request dispatchers "Frontends"

Problem: Not all models are used at all times

Problem: Many more models than backends, high memory footprint per model

No need to fetch and install the pink model

The trained models partitioned among the finite backends

Application / End UserResource efficiency

Low latency, SLAs

Static partitioning is infeasible

8

Page 19: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Classical approach: autoscaling

The number of active backends are automatically scaled up or

down based on load

Time

# Ac

tive

back

ends

fo

r the

pin

k m

odel Request load for

the pink model

9

Page 20: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Classical approach: autoscaling

The number of active backends are automatically scaled up or

down based on load

Time

# Ac

tive

back

ends

fo

r the

pin

k m

odel Request load for

the pink model

Enough backends to guarantee low latency# Active backends over time is minimized for resource efficiency

With ideal autoscaling ...

9

Page 21: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Autoscaling for MLaaS is challenging [1/3]

10

Page 22: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Autoscaling for MLaaS is challenging [1/3]

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction

(4) The backend fetches the pink model

(5) The request outcome is predicted

Multiple request dispatchers "Frontends"

10

Page 23: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Autoscaling for MLaaS is challenging [1/3]

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction

(4) The backend fetches the pink model

(5) The request outcome is predicted

Multiple request dispatchers "Frontends"

Provisioning Time (4)

Execution Time (5)>>

(~ a few seconds) (~ 10ms to 500ms)

Challenge

Predictive autoscaling to hide the provisioning latency

Requirement

10

Page 24: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS architecture is large-scale, multi-tiered

Frontends

Backends [ VMs, containers ]

Hardware broker

Autoscaling for MLaaS is challenging [2/3]

11

Page 25: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

MLaaS architecture is large-scale, multi-tiered

Frontends

Backends [ VMs, containers ]

Hardware broker

Autoscaling for MLaaS is challenging [2/3]

Challenge

Fast, coordination-free, globally-consistent autoscaling

decisions on the frontends

Requirement

Multiple frontends with partial information about

the workload

11

Page 26: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

"99% of requests must complete under 500ms"

Strict, model-specific SLAs on response times

"99.9% of requests must complete under 1s"

"[B] Tolerate up to 25% increase in request rates

without violating [A]"

"[A] 95% of requests must complete under

850ms"

Autoscaling for MLaaS is challenging [3/3]

12

Page 27: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

"99% of requests must complete under 500ms"

Strict, model-specific SLAs on response times

"99.9% of requests must complete under 1s"

No closed-form solutions to get response-time distributions

for SLA-aware autoscaling

Challenge

Accurate waiting-time and execution-time distributions

Requirement"[B] Tolerate up to 25%

increase in request rates without violating [A]"

"[A] 95% of requests must complete under

850ms"

Autoscaling for MLaaS is challenging [3/3]

12

Page 28: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

}Provisioning Time (4)

Execution Time (5)>>

(~ a few seconds) (~ 10ms to 500ms)

Challenges

Multiple frontends withpartial information about

the workload

No closed-form solutions to get response-time distributions

for SLA-aware autoscaling

We address these challenges

by leveraging specificML workload characteristicsand design an analytical model for resource estimationthat allows distributed and predictive autoscaling

Swayam: model-driven distributed autoscaling

13

Page 29: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Outline

1. System architecture, key ideas

2. Analytical model for resource estimation

3. Evaluation results

14

Page 30: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

System architecture

15

Page 31: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Application / End User

Application / End User

Application / End User

Application / End User

Hardware broker

Frontends Global pool of backends

Backends dedicated for the pink model

Backends dedicated for the blue model

Backends dedicated for the

green model

System architecture

15

Page 32: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Application / End User

Application / End User

Application / End User

Application / End User

Hardware broker

Frontends Global pool of backends

Backends dedicated for the pink model

Backends dedicated for the blue model

Backends dedicated for the

green model

System architecture

1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)

Objective: dedicated set of backends should dynamically scale

15

Page 33: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Application / End User

Application / End User

Application / End User

Application / End User

Hardware broker

Frontends

Backends dedicated for the pink model

System architectureLet's focus on the pink model

1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)

Objective: dedicated set of backends should dynamically scale

15

Page 34: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Key idea 1: Assign states to each backend

16

Page 35: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

cold

warm

In the global pool

Dedicated to a trained model

Key idea 1: Assign states to each backend

16

Page 36: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

cold

warm

In the global pool

Dedicated to a trained model in-use

not-in-use

Haven't executed a request for a

while

Maybe executing a request

Key idea 1: Assign states to each backend

16

Page 37: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

cold

warm

In the global pool

Dedicated to a trained model in-use

not-in-use

busy

idle

Haven't executed a request for a

while

Maybe executing a request

Waiting for a request

Executing a request

Key idea 1: Assign states to each backend

16

Page 38: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

cold

warm

In the global pool

Dedicated to a trained model in-use

not-in-use

busy

idle

Haven't executed a request for a

while

Maybe executing a request

Waiting for a request

Executing a request

Dedicated, but not used due to reduced load

Can be safely garbage collected

(scale-in)

Key idea 1: Assign states to each backend

... or easily transitioned to an in-use state (scale-out)

16

Page 39: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

cold

warm

In the global pool

Dedicated to a trained model in-use

not-in-use

busy

idle

Haven't executed a request for a

while

Maybe executing a request

Waiting for a request

Executing a request

Dedicated, but not used due to reduced load

Can be safely garbage collected

(scale-in)

Key idea 1: Assign states to each backend

... or easily transitioned to an in-use state (scale-out)

How do frontends know which dedicated backends to use, and which to not use?

16

Page 40: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 65

1110987 12

17

Page 41: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 65

1110987 12

If 9 backends are sufficient for SLA compliance ...

17

Page 42: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 65

1110987 12

= warm in-use busy/idle = warm not-in-use

Backends dedicated for the pink model

1 2 3 4 65

1110987 12

If 9 backends are sufficient for SLA compliance ...

backends 10-12 transition to not-in-use state

frontends use backends 1-9

17

Page 43: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Backends dedicated for the pink model

Key idea 2: Order the dedicated set of backends

1 2 3 4 65

1110987 12

= warm in-use busy/idle = warm not-in-use

Backends dedicated for the pink model

1 2 3 4 65

1110987 12

If 9 backends are sufficient for SLA compliance ...

backends 10-12 transition to not-in-use state

frontends use backends 1-9

How do frontends know how many backends are sufficient?

17

Page 44: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Frontends

Key idea 3: Swayam instance on every frontend

Incoming requests

Swayam instance

computes globally consistent minimum # backends necessary for SLA compliance

Backends dedicated for the pink model

1 2 3 4 65

1110987 12

= warm in-use busy/idle = warm not-in-use

18

Page 45: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Outline

1. System architecture, key ideas

2. Analytical model for resource estimation

3. Evaluation results

19

Page 46: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

20

Page 47: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

1. Expected request execution time2. Expected request waiting time3. Total request load

20

Page 48: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Making globally-consistent decisions

} leverage ML workload characteristics

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

1. Expected request execution time2. Expected request waiting time3. Total request load

20

Page 49: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1

Data from trace (bin width = 10)

21

Page 50: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OS

events main sources of variability

Variation is low

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1

Data from trace (bin width = 10)

21

Page 51: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OS

events main sources of variability

Variation is low

Modeled using log-normal distributions

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1

Data from trace (bin width = 10)

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1

Data from trace (bin width = 10)Fitted lognormal distribution

21

Page 52: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request waiting times

load balancing (LB)

22

Page 53: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request waiting times

load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

22

Page 54: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request waiting times

load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned Scheduling

Global and partitioned perform well, but there are implementation tradeoffs

22

Page 55: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request waiting times

load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned Scheduling

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned SchedulingJoin-Idle-Queue

JIQ doesn't result in good tail waiting

times

Global and partitioned perform well, but there are implementation tradeoffs

22

Page 56: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request waiting times

load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned Scheduling

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned SchedulingJoin-Idle-Queue

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned SchedulingJoin-Idle-Queue

Random Dispatch

JIQ doesn't result in good tail waiting

times

Global and partitioned perform well, but there are implementation tradeoffs

Random dispatch gives much better tail

waiting times

22

Page 57: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Determining expected request waiting times

load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned Scheduling

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned SchedulingJoin-Idle-Queue

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned SchedulingJoin-Idle-Queue

Random Dispatch

JIQ doesn't result in good tail waiting

times

Global and partitioned perform well, but there are implementation tradeoffs

Random dispatch gives much better tail

waiting times

22

We use a LB policy based on random dispatch!

Page 58: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

in the near future, to account for high provisioning times

Determining the total request load

23

Page 59: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

}in the near future, to account

for high provisioning times

Determining the total request load

Frontends

Hardware broker

L

Since the broker spreads requests uniformly among

each frontends

L' = L/F

F Total # frontendsL'

L'

Total request rate

23

Page 60: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

}in the near future, to account

for high provisioning times

Determining the total request load

Frontends

Hardware broker

L

L' = L/F

F L'

L'

‣Predicts L' for near futureEach Swayam instance

Depends on the time to setup a new backend

23

Page 61: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

}in the near future, to account

for high provisioning times

Determining the total request load

Frontends

Hardware broker

L

L' = L/F

F L'

L'

‣Predicts L' for near future

Determined from broker / through a

gossip protocol

Each Swayam instance

‣Given F, computes L = F x L'

23

Page 62: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

1. Expected request execution time2. Expected request waiting time3. Total request load

24

Page 63: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

SLA-aware resource estimationFor each

trained model

Response-Time Threshold

Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

25

Page 64: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

SLA-aware resource estimationFor each

trained model

Response-Time Threshold

Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

Waiting Time Distribution

Execution Time Distribution

Response Time Modeling

Load

SLmin percentile

response time< RTmax?

n = 1 n++

No

Yesx U

25

Page 65: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

SLA-aware resource estimationFor each

trained model

Response-Time Threshold

Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

Waiting Time Distribution

Execution Time Distribution

Response Time Modeling

Load

SLmin percentile

response time< RTmax?

n = 1 n++

No

Yesx U

Closed-form expression for percentile response time

(see the appendix)

Convolution

25

Page 66: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

SLA-aware resource estimationFor each

trained model

Response-Time Threshold

Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

Waiting Time Distribution

Execution Time Distribution

Response Time Modeling

Load

SLmin percentile

response time< RTmax?

n = 1 n++

No

Yesx U

Amplified based on the burst threshold

25

Page 67: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

SLA-aware resource estimationFor each

trained model

Response-Time Threshold

Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

Waiting Time Distribution

Execution Time Distribution

Response Time Modeling

Load

SLmin percentile

response time< RTmax?

n = 1 n++

No

Yesx U

Initialization

Retry, as long as not SLA compliant

25

Compute percentile response time for n

Page 68: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Frontends

Incoming requests

Swayam instance

computes globally consistent minimum # backends necessary for SLA compliance

Backends dedicated for the pink model

1 2 3 4 65

1110987 12

= warm in-use busy/idle = warm not-in-use

26

Swayam Framework

Page 69: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Outline

1. System architecture, key ideas

2. Analytical model for resource estimation

3. Evaluation results

27

Page 70: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Evaluation setup

• Prototype in C++ on top of Apache Thrift➡ 100 backends per service➡ 8 frontends➡ 1 broker➡ 1 server (for simulating the clients)

28

Page 71: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Evaluation setup

• Prototype in C++ on top of Apache Thrift➡ 100 backends per service➡ 8 frontends➡ 1 broker➡ 1 server (for simulating the clients)

• Workload➡ 15 production service traces (Microsoft Azure MLaaS)➡ Three-hour traces (request arrival times and computation times)➡Query computation & model setup times emulated by spinning

28

Page 72: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

SLA configuration for each model

• Response-time threshold RTmax = 5C ➡C denotes the mean computation time for the model

• Desired service level SLmin = 99%➡ 99% of the requests must have response times under RTmax

• Burst threshold U = 2x ➡ Tolerate increase in request rate by up to 100%

• Initially, 5 pre-provisioned backends

29

Page 73: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Baseline: Clairvoyant Autoscaler (ClairA)

➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend

➡ "Deadline-driven" approach to minimize resource waste

30

Page 74: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Baseline: Clairvoyant Autoscaler (ClairA)

• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload

➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend

➡ "Deadline-driven" approach to minimize resource waste

30

Page 75: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Baseline: Clairvoyant Autoscaler (ClairA)

• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload

• ClairA2 assumes non-zero setup times, lazy scale-ins➡ Swayam-like

➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend

➡ "Deadline-driven" approach to minimize resource waste

30

Page 76: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Baseline: Clairvoyant Autoscaler (ClairA)

• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload

• ClairA2 assumes non-zero setup times, lazy scale-ins➡ Swayam-like

• Both ClairA1 and ClairA2 depend on RTmax, but not on SLmin and U

➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend

➡ "Deadline-driven" approach to minimize resource waste

30

Page 77: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

31

Page 78: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

31

Page 79: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

31

Page 80: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

31

Page 81: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Frequency of SLA Compliance

31

Page 82: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam performs much better than ClairA2 in terms of

resource efficiency

31

Page 83: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam is resource efficient but at the cost of

SLA compliance31

Page 84: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam is resource efficient but at the cost of

SLA compliance31

Page 85: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Resource usage vs. SLA compliance

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam seems to perform poorly because of a very bursty trace

31

Page 86: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)

32

Page 87: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)

• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations

32

Page 88: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)

• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations

• Swayam strikes a good balance, for MLaaS prediction serving➡ by realizing significant resource savings➡ at the cost of occasional SLA violations

32

Page 89: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)

• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations

• Swayam strikes a good balance, for MLaaS prediction serving➡ by realizing significant resource savings➡ at the cost of occasional SLA violations

• Easy integration into any existing request-response architecture

32

Page 90: Swayam - microsoft.com · Machine Learning as a Service (MLaaS) + = Trained Model 2. Prediction Query Answer Models are already trained and available for prediction This work 2. Swayam

Thank you. Questions?

33