on-the-fly measurements to evaluate sla in distributed ...hpaik/pesos2014/pesos... · • intrinsic...

18
K. Ravindran, A. Adithathan & M. Iannelli Department of Computer Science City University of New York [CUNY – City College & Graduate Center] ** A part of this work was supported by AFRL-VFRP funding and by CISCO student internship during Summer 2013 On-the-fly Measurements to Evaluate SLA in Distributed Cloud Services

Upload: others

Post on 29-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

K. Ravindran, A. Adithathan & M. Iannelli Department of Computer Science

City University of New York [CUNY – City College & Graduate Center]

** A part of this work was supported by AFRL-VFRP funding and by CISCO student internship during Summer 2013

On-the-fly Measurements to Evaluate SLA in Distributed Cloud Services

Outline of presentation

•  Intrinsic complexity of cloud-based distributed services (resource uncertainties, lack of trust, difficulty of measurements. . .)

• Hostile external environment (failures, attacks, cheating) • Certification of cloud services vis-à-vis non-functional goals • Service-level metrics to benchmark cloud providers: elasticity, linearity, isolation, robustness, . . . • Case study: Replicated task manager on clouds

Issues due to third-party control cloud infrastructure

1. Less-than-promised VM cycles 2. Less-then-promised VM storage/memory

(due to the “statistical resource sharing”

based business model adopted by cloud provider)

(similar to how banks operate)

3. Absence of mechanisms to secure VMs against threats 4. Large VM replenishment time when VMs crash 5. Lax integrity of the data stored on VMs 6. Lax integrity of the algorithmic software processes running on VMs

.

.

. The above issues adversely impact the quality of high-level services and applications running on the cloud

Trust relationship between sub-systems [ref: General treatise by A. Avizienis, B. Randell, et al.,

IEEE Transactions on Secure Dependable Computing (2004)]

Cloud-based system

infrastructure (IaaS)

Service-layer algorithms

(PaaS)

Client Applications

(SaaS) how good is the

QoS enforcement by an algorithm ?

QoS specs and SLA for tasks

expected guarantees (VM cycles,

network BW, . .)

how good the guarantees are

enforced ?

Does the cloud infrastructure insulate the VMs and network paths from attacks and failures ??

How aggressive is the VM multiplexing on physical nodes ?

probabilistic guarantees ??

SLA ≡ [promised QoS, penalty for non-compliance] Is client task-generation compliant with specs ?

X: attack on VM

difficult-to-measure parameters: # of VMs suffering failure: fa = 2;

VM failure probability: q

VM running server instance

Idle VM node (available cycles: Vc)

CLIENT APPLICATIONS

DATA SERVICE

CLOUD INFRASTRUCTURE (manages resources:

VM, storage, . .)

server instantiation on VMs (# of VMs running server:

Nv=3)

service interface

infrastructure interface

(management of server redundancy,

security, . .)

Server replication control algorithm

X X

leasing of VMs from cloud

(K=6)

server task requests

DAT

A

PLA

NE

CO

NT

RO

L PL

AN

E QoS specs

(response time desired: T ≤ Δ; fault-tolerance desired)

actual QoS observed

(response time; fault-tolerance)

Log

of m

eta-

data

fo

r QoS

aud

iting

reason about how good the QoS obligations are met

Trusted, neutral

auditor H

Architecture for SLA evaluation (an example of data replication)

(provider of value-added

service)

(provider of raw service)

fm, . .

K > Nv ≥ 2fm+1 (assuming

no byzantine failures)

T,Δ, . .

Vc, . . [hard to measure]

Parameter dependencies between “data service” metrics and cloud infrastructure

Strategy to time-stamp data & determine global virtual time, data update frequency,

VM failure rate

# of VMs leased, VM failure rate, VM downtime, replication strategy

. . . .

Service metrics

1.Data access latency

Parameters used in mapping to cloud infrastructure

2. Data coherence

3. Performance stability VM cycles, degree of VM multiplexing

4. Data availability

# of VMs hosting servers, VM cycles, network BW, replication strategy

Often, some form of closed-form mapping relations exist that allow a coarse estimate of the service metrics from the infrastructure parameters

è Mapping function is specific to the algorithm employed

host R R R host

path availability ζ = (1-r)^^3 Example end-to-end path

r: failure probability of router R

Service robustness vs Service resilience: QoS-based view

cloud of VM nodes with storage

prescription of desired QoS (q)

VM attack (1#)

TIME

full service

recovery

initial configuration

VM attack (2#s)

partial service

recovery

replace failed VM Replace 1-of-2

failed VMs

ε=ε1 [εt < ε1 < εm]: è

SERVICE RESILIENCE

minimum acceptable QoS

(q”)

QoS level

measured QoS ( q’ )

Max. allowed

error margin εm

ε0

sustainable QoS error margin ε1

QoS error: ε=q-q’ steady-state error limit: εt

[ε0 < εt]: è

SERVICE ROBUSTNESS

ε=ε0

. . . .

VM instances executing on

physical machine

client 1

client 2

client 3

client k

CLOUD SERVICE [realized by a

software algorithm running on N VMs]

. .

λ1 λ2

λ3 λk

Combined VM load offered on CSP

Com

bine

d V

M lo

ad su

stai

ned

by C

SP

λi: Service request rate desired by i-th client

λ’i: Service request rate sustainable by i-th client

0 < λi’ < λi < α.C α: fractional VM capacity promised by a CSP (0 < α < 1)

C physical

machine capacity

C’ for k=kx

⇒< CkCx

.'α

VM capacity experienced by a client is less than

promised level

C” C

N ≥ 1; k >> N

λi

λ’i

rate requested by i-th client

rate

sust

aina

ble

by i-

th c

lient

α.C

kCx

'

drop in VM cycles experienced by i-th client when requesting λ0 cycles

λ0

λ’i=βx.λi (0 < βx < 1.0)

⇒> 0λdλ'di

i elasticity

service invocation

(λ1+λ2+ . . +λk)

(λ’

1+λ’

2+ .

. +λ’

k)

[for a more aggressive CSP y]

C’” for k=ky (ky > kx)

CSP y is more aggressive than CSP x w.r.t. statistical sharing

of VM cycles

kCy

"'

Service elasticity, linearity, and isolation

Measurement methodology IaaS parameters (i.e., VM cycles, storage speed/capacity, . . .) are difficult to measure. But PaaS layer output parameters are generally easy to measure [e.g., when hiring a taxi service to airport, it is hard to measure driving speed of taxi s, But the time to reach airport T is easily measurable. If d is the published distance to airport, then driving speed may be inferred as : s = d/T

We believe that a closed-form mapping relationship between the PaaS-layer outputs O and the IaaS parameters par O = F (par,E) è par = F (O,E)

-

IaaS layer (resource allocation [Vc, Dc, …]

PaaS layer (runs service-level algorithm)

par E*

environment conditions

E: set of environment parameters known to

designer [⊆ E*]

Abstracted as a black-box (realizable with a closed-form

mapping procedure F)

service-level output O

Prototype system results (on UNIX-based LAN) fm: Max. # of VM failures assumed by replication protocol fa: Actual # of VM failures K: Total VM pool size (N ≤ K)

time-to-complete query experienced by a client

TTC (in msec)

W: promised network bandwidth Vc: promised VM cycles

TTC results when 5 instances of replication algorithm

share the resources (i.e., k=5)

TTC results when a single instance of replication algorithm uses the VM cycles & network bandwidth

δ: Increase in TTC due to resource sharing with other instances [change in fm from 2 to 3 for a given

(N,W,Vc) incurs a lower δ, where N ≥ 7]

delayed query result (TTC > Δ) is less useful è reduces “service availability”

250

750

1250

10 # of VMs participating in data replication (N)

[per instance of replication algorithm]

3 4 5 6 7 8 9

x

x x x o

o o o o

o

Δ=600 msec (say) 500

δ(2)

δ(3) k=5

k=5

k: # of instances of replication algorithm running concurrently

(k ≥ 1)

# of concurrent instances of replication algorithm sharing VM cycles and network bandwidth [k]

Maximum per-client

query completion

rate (λ per sec)

3 4 5 6 7 2 1

N=8; W=300 kbps; Vc=1.2 ghz

2

4

6

8

fm=2, fa=0 fm=3, fa=0

No reduction in Vc, except due to statistical mux

of VM cycles

20% reduction in Vc, due to propensity of CSP

to cheat on VM cycles

1≤ fm <⎡ N /2⎤; 0 ≤ fa ≤K

9 [each of the 8 replica nodes

runs k instances of voter task]

depicts cases of non-isolation experienced

Experimental results on replicated image data processing on clouds

cloud-based system implementing core services adaptation control

functions of application

QoS control actions (I)

observe parameters of

actual QoS (O) & environment (E)

Third-party verification mechanism [reasons about system behavior by online and offline testing methods]

composite system S

true model: O*=G*(I,s*,E*) [computational model of S

programmed in controller: O=G(I,s,E) ] state s*

Our model-based approach to SLA compliance checking hostile external

environment E*

Two main issues in design of cloud-based network systems 1.  System certification How good a cloud-based network system S behaves relative to what S is supposed to provide to applications Existing efforts on verification of distributed algorithms (such as state-space analysis) have focused primarily on functional requirements of S E..g., does a data transfer protocol X (which is embodied in S) reliably transfer data packets in the presence of network loss ? Certification involves analyzing the non-functional attributes of S (say, how fast S reacts to VM resource outages in the cloud, how stable is the QoS output, . . . ) -- How can the compliance of S relative to a prescribed specs of S be ascertained ? Service auditing by a trusted third-party entity ??

2. Autonomic system composition How can the system S dynamically adjust its behavior to optimally respond to a failure ? fight-through capability of S --- Dynamic composition of modules involves software-engineering challenges --- Need to employ service-oriented architectures

Our research goal

Design of certifiable cloud-based distributed systems where the adaptation processes can be externally managed and reasoned about

A cloud-based network system S that cannot be certified about service quality is useless when S is deployed to meet mission-critical needs

--- even if S has been designed to function well

Being good is one thing, but being verifiably good is another thing [e.g., a student X with a GPA of 3.5 (say) is preferred for employment than a

student Y who is more knowledgeable than X but does not possess a GPA certificate]

We assign a score for each system that advertises a service to clients [score is in the range (0.0,1.0): 1.0(-) è best; 0+ è worst ] Systems are compared, based on scores, for risk assessment during deployments (scores enable measurable ways of system improvement)

DATA service domain: A web service X offers 90% availability under failures 10% failure probability of the VMs hosting the web servlets; whereas, a service Y offers only 85% availability under the same fault scenarios [buyer of DATA service makes a purchase decision based on the rating]

1.  Formulate model-based engineering approach to determine the reasons for SLA violations

2. Machine-intelligence is employed in a system auditor tool to reason about the service compliance to a reference behavior (e.g., PO-MDP techniques)

Typical QoS metrics: latency, availability, consistency

Key ideas in our approach

3. Inject simulated attacks and stressor events to study the service resilience in various operating conditions.

Goal: Non-intrusive monitoring of cloud services by auditor

QoS specs q, algorithm parameters par, system resource allocation R are usually controllable inputs In contrast, environment parameters e ∈ E* are often uncontrollable and/or unobservable, but they do impact the service-level performance (e.g., component failures, network traffic fluctuations, attacks, etc)

environment parameter space: E* = E(yk) ∪ E(nk) ∪ E(ck)

parameters that the designer knows about

parameters that the designer does not

currently know about

parameters that the designer can never

know about Algorithm design decisions face this uncertainty --- so, designer makes

certain assumptions about the environment (e.g., no more than 2 nodes will fail during execution of a distributed algorithm).

When assumptions get violated, say, due to attacks, algorithms fall short of what they are designed to achieve

è Evaluate how good an algorithm performs under strenuous conditions

Modeling of external environment

incidence of hostile environment condition E*

actual QoS experienced

by application (output)

system state φ visible at INT

Controller

(compute resource adjustments)

corrective action reference

QoS Pref (input)

Observer (state-to-QoS

mapping)

control-theoretic loop to realize QoS-to-resource mapping

AD

APT

IVE

APP

LIC

ATIO

N

(infrastructure model programmed

into controller)

QoS auditor H

SI

y z p q

cloud-based service-infrastructure

Storage & processing

INT

INT: service interface

Network connects

INT

achieved QoS P’ (steady-state)

ε=Pref-P’ QoS tracking error

[ε, Pref, φ, E]

LOG Intelligent management entity

reason about application behavior (resilience, performance, robustness, . .)

service-support system S

ε  is a measure of how trust-worthy

the system is

control-theoretic view of QoS adaptation in a cloud service

(partial knowledge about failures/attacks)

distributed softw

are system

S adaptation processes

core-functional elements (algorithms,

cloud resources, . .)

signal flows

Man

agem

ent

entit

y H

evaluate functional behavior [FUNC(S),Pref,Pact)]

evaluate para-functional behavior [PARA(S),Pref,Pact),E]

input reference Pref

actual output Pact

trigger

APPLICATION-LEVEL USER

incidence of uncontrolled

environment E

certify(S) {100%-good, 90%-good, . .,

bad,…}

[quality(S)=good] AND [certify(S)=good] è [accuracy(H)=high]; Axioms of system observation

[quality(S)=good] AND [certify(S)=bad] è [accuracy(H)=low]; [quality(S)=bad] AND [certify(S)=bad] è [accuracy(H)=high]; [quality(S)=bad] AND [certify(S)=good] è [accuracy(H)=low]. [quality(S)=80%-good] AND [certify(S)=90%-good] è [accuracy(H)=medium];

monitor

Certifying non-functional attributes of system behavior

Future research plans

•  Injection of attack and stressor events on cloud-based system being tested

•  Incorporation of system utility functions and SLA penalty as part of dependability analysis of cloud systems

•  Identification of probabilistic measures of system quality

•  Machine-intelligence and Markov decision processes for system analysis

•  Cyber-physical systems methods for autonomic system improvement