swayam - microsoft.com · machine learning as a service (mlaas) + = trained model 2. prediction...

Arpan Gujarati, Björn B. Brandenburg

Sameh Elnikety, Yuxiong He Kathryn S. McKinley

SwayamDistributed Autoscaling for Machine Learning as a Service

1

Machine Learning as a Service (MLaaS)

Data Science & Machine Learning

Amazon Machine Learning

Machine Learning

Google Cloud AI

2


Data Science & Machine Learning

Amazon Machine Learning

Machine Learning

Google Cloud AI

+ =Trained Model

Untrained model

Dataset

1. Training

+ =Trained Model

2. Prediction

QueryAnswer

2


+ =Trained Model

2. Prediction

QueryAnswer

Models are already trained and available for prediction

This work

2

Swayam

inside the MLaaS infrastructure

Distributed autoscaling

of the compute resourcesneeded for prediction serving + =

Trained Model

2. Prediction

QueryAnswer

3

Prediction serving (application perspective)

MLaaS Provider

Image classifier

"cat"

image

Application / End User

4

Lots of trained models!

MLaaS Provider Finite compute resources "Backends" for prediction

Prediction serving (provider perspective)

5


MLaaS Provider Finite compute resources "Backends" for prediction Application / End User

(1) New prediction request for the

pink model

(2) A frontend receives the request

Multiple request dispatchers "Frontends"


5




pink model

(3) The request is dispatched to an

idle backend


(4) The backend fetches the pink model



5




pink model

(3) The request is dispatched to an

idle backend


(4) The backend fetches the pink model (5) The request

outcome is predicted

(6) The response is sent back through

the frontend



5


MLaaS Provider Application / End UserApplication / End User


Finite compute resources "Backends" for prediction

Prediction serving (objectives)

6


MLaaS ProviderMLaaS Provider Application / End UserResource efficiency

Low latency, SLAs

Application / End UserApplication / End User


Finite compute resources "Backends" for prediction

Prediction serving (objectives)

6

Static partitioning of trained models

7

MLaaS ProviderMLaaS Provider

Static partitioning of trained modelsThe trained models partitioned among the finite backends

7

MLaaS ProviderMLaaS Provider Application / End User



No need to fetch and install the pink model

The trained models partitioned among the finite backends

7



Problem: Not all models are used at all times




7




Problem: Many more models than backends, high memory footprint per model




7

MLaaS Provider

Static partitioning of trained modelsMLaaS Provider



Problem: Many more models than backends, high memory footprint per model



Application / End UserResource efficiency

Low latency, SLAs

Static partitioning is infeasible

8

Classical approach: autoscaling

The number of active backends are automatically scaled up or

down based on load

Time

# Ac

tive

back

ends

fo

r the

pin

k m

odel Request load for

the pink model

9

Classical approach: autoscaling

The number of active backends are automatically scaled up or

down based on load

Time

# Ac

tive

back

ends

fo

r the

pin

k m

odel Request load for

the pink model

Enough backends to guarantee low latency# Active backends over time is minimized for resource efficiency

With ideal autoscaling ...

9

Autoscaling for MLaaS is challenging [1/3]

10





(5) The request outcome is predicted


10





(5) The request outcome is predicted


Provisioning Time (4)

Execution Time (5)>>

(~ a few seconds) (~ 10ms to 500ms)

Challenge

Predictive autoscaling to hide the provisioning latency

Requirement

10

MLaaS architecture is large-scale, multi-tiered

Frontends

Backends [ VMs, containers ]

Hardware broker


11

MLaaS architecture is large-scale, multi-tiered

Frontends

Backends [ VMs, containers ]

Hardware broker


Challenge

Fast, coordination-free, globally-consistent autoscaling

decisions on the frontends

Requirement

Multiple frontends with partial information about

the workload

11

"99% of requests must complete under 500ms"

Strict, model-specific SLAs on response times

"99.9% of requests must complete under 1s"

"[B] Tolerate up to 25% increase in request rates

without violating [A]"

"[A] 95% of requests must complete under

850ms"


12

"99% of requests must complete under 500ms"

Strict, model-specific SLAs on response times

"99.9% of requests must complete under 1s"

No closed-form solutions to get response-time distributions

for SLA-aware autoscaling

Challenge

Accurate waiting-time and execution-time distributions

Requirement"[B] Tolerate up to 25%

increase in request rates without violating [A]"

"[A] 95% of requests must complete under

850ms"


12

}Provisioning Time (4)

Execution Time (5)>>

(~ a few seconds) (~ 10ms to 500ms)

Challenges

Multiple frontends withpartial information about

the workload

No closed-form solutions to get response-time distributions

for SLA-aware autoscaling

We address these challenges

by leveraging specificML workload characteristicsand design an analytical model for resource estimationthat allows distributed and predictive autoscaling

Swayam: model-driven distributed autoscaling

13

Outline

1. System architecture, key ideas

2. Analytical model for resource estimation

3. Evaluation results

14

System architecture

15





Hardware broker

Frontends Global pool of backends

Backends dedicated for the pink model

Backends dedicated for the blue model

Backends dedicated for the

green model

System architecture

15





Hardware broker

Frontends Global pool of backends


Backends dedicated for the blue model

Backends dedicated for the

green model

System architecture

1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)

Objective: dedicated set of backends should dynamically scale

15





Hardware broker

Frontends


System architectureLet's focus on the pink model

1. If load decreases, extra backends go back to the global pool (for resource efficiency)2. If load increases, new backends are set up in advance (for SLA compliance)

Objective: dedicated set of backends should dynamically scale

15

Key idea 1: Assign states to each backend

16

cold

warm

In the global pool

Dedicated to a trained model


16

cold

warm

In the global pool

Dedicated to a trained model in-use

not-in-use

Haven't executed a request for a

while

Maybe executing a request


16

cold

warm

In the global pool


not-in-use

busy

idle


while


Waiting for a request

Executing a request


16

cold

warm

In the global pool


not-in-use

busy

idle


while



Executing a request

Dedicated, but not used due to reduced load

Can be safely garbage collected

(scale-in)


... or easily transitioned to an in-use state (scale-out)

16

cold

warm

In the global pool


not-in-use

busy

idle


while



Executing a request

Dedicated, but not used due to reduced load

Can be safely garbage collected

(scale-in)


... or easily transitioned to an in-use state (scale-out)

How do frontends know which dedicated backends to use, and which to not use?

16


Key idea 2: Order the dedicated set of backends

1 2 3 4 65

1110987 12

17



1 2 3 4 65

1110987 12

If 9 backends are sufficient for SLA compliance ...

17



1 2 3 4 65

1110987 12

= warm in-use busy/idle = warm not-in-use


1 2 3 4 65

1110987 12


backends 10-12 transition to not-in-use state

frontends use backends 1-9

17



1 2 3 4 65

1110987 12



1 2 3 4 65

1110987 12


backends 10-12 transition to not-in-use state

frontends use backends 1-9

How do frontends know how many backends are sufficient?

17

Frontends

Key idea 3: Swayam instance on every frontend

Incoming requests

Swayam instance

computes globally consistent minimum # backends necessary for SLA compliance


1 2 3 4 65

1110987 12


18

Outline




19

Making globally-consistent decisions

What is the minimum # backends required for SLA compliance?

at each frontend (Swayam instance)

20




1. Expected request execution time2. Expected request waiting time3. Total request load

20


} leverage ML workload characteristics




20

Determining expected request execution times

Studied execution traces of 15 popular services hosted on Microsoft Azure's MLaaS platform

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1

Data from trace (bin width = 10)

21



‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OS

events main sources of variability

Variation is low

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1


21



‣ Fixed-sized feature vectors‣ Input-independent control flow‣ Non-deterministic machine & OS

events main sources of variability

Variation is low

Modeled using log-normal distributions

0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1


0

5

10

15

20

25

30

35

0 50 100 150 200 250 300 350 400N

orm

aliz

ed F

requ

ency

(%)

Service Times (ms)

Trace 1

Data from trace (bin width = 10)Fitted lognormal distribution

21

Determining expected request waiting times

load balancing (LB)

22


load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

22


load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)Global Scheduling

Partitioned Scheduling

Global and partitioned perform well, but there are implementation tradeoffs

22


load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends


Partitioned SchedulingJoin-Idle-Queue

JIQ doesn't result in good tail waiting

times


22


load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



Random Dispatch


times


Random dispatch gives much better tail

waiting times

22


load balancing (LB)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends

Threshold (350ms)

0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



0 100 200 300 400 500 600 700

30 40 50 60 70 80 90 100

Wai

ting

Tim

e (m

s)

#Backends



Random Dispatch


times


Random dispatch gives much better tail

waiting times

22

We use a LB policy based on random dispatch!

in the near future, to account for high provisioning times

Determining the total request load

23

}in the near future, to account

for high provisioning times


Frontends

Hardware broker

L

Since the broker spreads requests uniformly among

each frontends

L' = L/F

F Total # frontendsL'

L'

Total request rate

23




Frontends

Hardware broker

L

L' = L/F

F L'

L'

‣Predicts L' for near futureEach Swayam instance

Depends on the time to setup a new backend

23




Frontends

Hardware broker

L

L' = L/F

F L'

L'

‣Predicts L' for near future

Determined from broker / through a

gossip protocol

Each Swayam instance

‣Given F, computes L = F x L'

23





24

SLA-aware resource estimationFor each

trained model

Response-Time Threshold

Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

25


trained model


Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends

Waiting Time Distribution

Execution Time Distribution

Response Time Modeling

Load

SLmin percentile

response time< RTmax?

n = 1 n++

No

Yesx U

25


trained model


Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends




Load

SLmin percentile


n = 1 n++

No

Yesx U

Closed-form expression for percentile response time

(see the appendix)

Convolution

25


trained model


Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends




Load

SLmin percentile


n = 1 n++

No

Yesx U

Amplified based on the burst threshold

25


trained model


Service Level

Burst Threshold

RTmax

SLmin

U

n = min # backends




Load

SLmin percentile


n = 1 n++

No

Yesx U

Initialization

Retry, as long as not SLA compliant

25

Compute percentile response time for n

Frontends

Incoming requests

Swayam instance

computes globally consistent minimum # backends necessary for SLA compliance


1 2 3 4 65

1110987 12


26

Swayam Framework

Outline




27

Evaluation setup

• Prototype in C++ on top of Apache Thrift➡ 100 backends per service➡ 8 frontends➡ 1 broker➡ 1 server (for simulating the clients)

28

Evaluation setup

• Prototype in C++ on top of Apache Thrift➡ 100 backends per service➡ 8 frontends➡ 1 broker➡ 1 server (for simulating the clients)

• Workload➡ 15 production service traces (Microsoft Azure MLaaS)➡ Three-hour traces (request arrival times and computation times)➡Query computation & model setup times emulated by spinning

28

SLA configuration for each model

• Response-time threshold RTmax = 5C ➡C denotes the mean computation time for the model

• Desired service level SLmin = 99%➡ 99% of the requests must have response times under RTmax

• Burst threshold U = 2x ➡ Tolerate increase in request rate by up to 100%

• Initially, 5 pre-provisioned backends

29

Baseline: Clairvoyant Autoscaler (ClairA)

➡ It knows the processing time of each request beforehand➡ It can travel back in time to provision a backend

➡ "Deadline-driven" approach to minimize resource waste

30


• ClairA1 assumes zero setup times, immediate scale-ins➡Reflects the size of the workload



30



• ClairA2 assumes non-zero setup times, lazy scale-ins➡ Swayam-like



30



• ClairA2 assumes non-zero setup times, lazy scale-ins➡ Swayam-like

• Both ClairA1 and ClairA2 depend on RTmax, but not on SLmin and U



30

Resource usage vs. SLA compliance

31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2

Swayam (frequency of SLA compliance)

31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Frequency of SLA Compliance

31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam performs much better than ClairA2 in terms of

resource efficiency

31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam is resource efficient but at the cost of

SLA compliance31


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


0 0.2 0.4 0.6 0.8

1 1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Res

ourc

e us

age

(nor

mal

ized

)

Trace IDs

ClairA1ClairA2


97%

98%

64%

95%

100%

100%

100%

100%

100%

100%

100% 87

%

91%

89%

97%

Swayam seems to perform poorly because of a very bursty trace

31

Summary• Perfect SLA, irrespective of the input workload, is too expensive➡ in terms of resource usage (as modeled by ClairA)

32


• To ensure resource efficiency, practical systems➡ need to trade off some SLA compliance➡while managing client expectations

32



• Swayam strikes a good balance, for MLaaS prediction serving➡ by realizing significant resource savings➡ at the cost of occasional SLA violations

32



• Swayam strikes a good balance, for MLaaS prediction serving➡ by realizing significant resource savings➡ at the cost of occasional SLA violations

• Easy integration into any existing request-response architecture

32

Thank you. Questions?

33

swayam - microsoft.com · machine learning as a service (mlaas) + = trained model 2. prediction...

Documents