hotspot mitigation for the masses · hotspot mitigation for the masses fabien hermenier, aditya...

15
Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing 2019

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Hotspot mitigation for the masses

Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra

ACM Symposium On Cloud Computing 2019

Page 2: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Entreprise cloud company

~ 15,000 customers worldwide

~ 40,000 private clouds deployments

(and we are recruiting)

Page 3: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Private clouds

to hyper-converged infrastructures (HCI)

SAN based, remote I/Os

Distributed file-system favouring local I/Os, one controller VM per node

From converged

Page 4: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

602 private clouds

~ 4 node clusters, 13 VMs per node long tail distribution

~ 1.31:1 vCPU/thread, up to 9:1

~25% CPU, ~2% I/Os (dynamic allocation) ~44% memory (static allocation)

small clusters and beefy nodes fit SMB needs

oversubscribed cores

moderate load

no relationship between dimensions

see the distributions in the paper

Page 5: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Fix hotspots induced by dynamic resources allocation

Cron based Threshold based

NP-hard No holy grail

Scheduler specialisation may alter its applicability

cpu

mem

cpu

mem

cpu

mem

Acropolis Dynamic Scheduler (ADS)

Page 6: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Doing great for the 1%

Page 7: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Established in 2009

Tech unicorn in 2013

~3,700 employees

Doing ok for the 99%

Page 8: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Exact approach on top of

Inside ADSBtrPlace

Constraint programming backend to avoid over-filtering

ObjectiveMinimise data movement Tend to balance

ActuationVM migrations (up to 2 in parallel) Admin notification upon no solutions

Page 9: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Lessons learntLooking at 2,668 clusters that called ADS at least once

Page 10: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Service latency is good enough

0.5% undecidable problems

Working with an exact approach

Continuous search helps yield better mitigation plans

de-facto sizing limit

Scale beyond sizing limits

0

5

10

15

90 93 94 95 99percentiles

dura

tion

(sec

.) first solutionlast solutionlatency

0.00

0.25

0.50

0.75

1.00

0 25 50 75 100saved migrations (%)

CC

DF

0.0

0.2

0.4

4 32 64 128 256cluster size (nodes)

dura

tion

(sec

.)

In the paper: engineering particularities

Page 11: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

..

..

....

. . .

..

.

..

.

. ...

..

..

.. .. .

new feature

Optimise to reduce undecidable rate, migrations

Beware of false quick wins

The dataset bias dilemma

Looking for workload agnostic optimisations

..

..

....

. . .

..

..

.

.

. ...

..

.. .

. .. .

Chasing outliers requires trade-offs

Page 12: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Low overall load, local hotspots.

Manage only supposed mis-placed VMs

Pin “well placed VM”

Local search to reduce the problem size

Available in BtrPlace

Enabled in ADS 1.0 during the prototyping phase

Page 13: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Local search considered useful and harmful

Over-filtering issues reported Moved to a 2-phases resolution

62.991.38

0.3

34.411.61

0.22

32.261.62−2.34

retry without local searchon timeout

retry without local searchif unsolvable

pure local search

0 20 40 60improvement wrt. full resolution (%)

latencymigrationssolved problems

Local search enabled, then disabled if needed Trigger reconsidered over time

Page 14: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

Practical effectiveness

73.28% if ADS issues a plan

12.24% If unsolvable

Complex to analyse without a/b testing The success rate is a consequence of subjective modelling choices How many clusters in a clean state after a call to ADS ?

Page 15: Hotspot mitigation for the masses · Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing

ConclusionIt is about supporting diverse workload

Not all enhancements are safe

Tools and knowledge bases are crucial

Incremental improvements from observation small wins matter

Trading quality for capability

It is not about developing a new feature, it is about checking its side effects

Exhibit and characterise outliers It is about preventing regressions

We are hiring: https://www.nutanix.com/careers