on-the-fly measurements to evaluate sla in distributed ...hpaik/pesos2014/pesos... · • intrinsic...
TRANSCRIPT
K. Ravindran, A. Adithathan & M. Iannelli Department of Computer Science
City University of New York [CUNY – City College & Graduate Center]
** A part of this work was supported by AFRL-VFRP funding and by CISCO student internship during Summer 2013
On-the-fly Measurements to Evaluate SLA in Distributed Cloud Services
Outline of presentation
• Intrinsic complexity of cloud-based distributed services (resource uncertainties, lack of trust, difficulty of measurements. . .)
• Hostile external environment (failures, attacks, cheating) • Certification of cloud services vis-à-vis non-functional goals • Service-level metrics to benchmark cloud providers: elasticity, linearity, isolation, robustness, . . . • Case study: Replicated task manager on clouds
Issues due to third-party control cloud infrastructure
1. Less-than-promised VM cycles 2. Less-then-promised VM storage/memory
(due to the “statistical resource sharing”
based business model adopted by cloud provider)
(similar to how banks operate)
3. Absence of mechanisms to secure VMs against threats 4. Large VM replenishment time when VMs crash 5. Lax integrity of the data stored on VMs 6. Lax integrity of the algorithmic software processes running on VMs
.
.
. The above issues adversely impact the quality of high-level services and applications running on the cloud
Trust relationship between sub-systems [ref: General treatise by A. Avizienis, B. Randell, et al.,
IEEE Transactions on Secure Dependable Computing (2004)]
Cloud-based system
infrastructure (IaaS)
Service-layer algorithms
(PaaS)
Client Applications
(SaaS) how good is the
QoS enforcement by an algorithm ?
QoS specs and SLA for tasks
expected guarantees (VM cycles,
network BW, . .)
how good the guarantees are
enforced ?
Does the cloud infrastructure insulate the VMs and network paths from attacks and failures ??
How aggressive is the VM multiplexing on physical nodes ?
probabilistic guarantees ??
SLA ≡ [promised QoS, penalty for non-compliance] Is client task-generation compliant with specs ?
X: attack on VM
difficult-to-measure parameters: # of VMs suffering failure: fa = 2;
VM failure probability: q
VM running server instance
Idle VM node (available cycles: Vc)
CLIENT APPLICATIONS
DATA SERVICE
CLOUD INFRASTRUCTURE (manages resources:
VM, storage, . .)
server instantiation on VMs (# of VMs running server:
Nv=3)
service interface
infrastructure interface
(management of server redundancy,
security, . .)
Server replication control algorithm
X X
leasing of VMs from cloud
(K=6)
server task requests
DAT
A
PLA
NE
CO
NT
RO
L PL
AN
E QoS specs
(response time desired: T ≤ Δ; fault-tolerance desired)
actual QoS observed
(response time; fault-tolerance)
Log
of m
eta-
data
fo
r QoS
aud
iting
reason about how good the QoS obligations are met
Trusted, neutral
auditor H
Architecture for SLA evaluation (an example of data replication)
(provider of value-added
service)
(provider of raw service)
fm, . .
K > Nv ≥ 2fm+1 (assuming
no byzantine failures)
T,Δ, . .
Vc, . . [hard to measure]
Parameter dependencies between “data service” metrics and cloud infrastructure
Strategy to time-stamp data & determine global virtual time, data update frequency,
VM failure rate
# of VMs leased, VM failure rate, VM downtime, replication strategy
. . . .
Service metrics
1.Data access latency
Parameters used in mapping to cloud infrastructure
2. Data coherence
3. Performance stability VM cycles, degree of VM multiplexing
4. Data availability
# of VMs hosting servers, VM cycles, network BW, replication strategy
Often, some form of closed-form mapping relations exist that allow a coarse estimate of the service metrics from the infrastructure parameters
è Mapping function is specific to the algorithm employed
host R R R host
path availability ζ = (1-r)^^3 Example end-to-end path
r: failure probability of router R
Service robustness vs Service resilience: QoS-based view
cloud of VM nodes with storage
prescription of desired QoS (q)
VM attack (1#)
TIME
full service
recovery
initial configuration
VM attack (2#s)
partial service
recovery
replace failed VM Replace 1-of-2
failed VMs
ε=ε1 [εt < ε1 < εm]: è
SERVICE RESILIENCE
minimum acceptable QoS
(q”)
QoS level
measured QoS ( q’ )
Max. allowed
error margin εm
ε0
sustainable QoS error margin ε1
QoS error: ε=q-q’ steady-state error limit: εt
[ε0 < εt]: è
SERVICE ROBUSTNESS
ε=ε0
. . . .
VM instances executing on
physical machine
client 1
client 2
client 3
client k
CLOUD SERVICE [realized by a
software algorithm running on N VMs]
. .
λ1 λ2
λ3 λk
Combined VM load offered on CSP
Com
bine
d V
M lo
ad su
stai
ned
by C
SP
λi: Service request rate desired by i-th client
λ’i: Service request rate sustainable by i-th client
0 < λi’ < λi < α.C α: fractional VM capacity promised by a CSP (0 < α < 1)
C physical
machine capacity
C’ for k=kx
⇒< CkCx
.'α
VM capacity experienced by a client is less than
promised level
C” C
N ≥ 1; k >> N
λi
λ’i
rate requested by i-th client
rate
sust
aina
ble
by i-
th c
lient
α.C
kCx
'
drop in VM cycles experienced by i-th client when requesting λ0 cycles
λ0
λ’i=βx.λi (0 < βx < 1.0)
⇒> 0λdλ'di
i elasticity
service invocation
(λ1+λ2+ . . +λk)
(λ’
1+λ’
2+ .
. +λ’
k)
[for a more aggressive CSP y]
C’” for k=ky (ky > kx)
CSP y is more aggressive than CSP x w.r.t. statistical sharing
of VM cycles
kCy
"'
Service elasticity, linearity, and isolation
Measurement methodology IaaS parameters (i.e., VM cycles, storage speed/capacity, . . .) are difficult to measure. But PaaS layer output parameters are generally easy to measure [e.g., when hiring a taxi service to airport, it is hard to measure driving speed of taxi s, But the time to reach airport T is easily measurable. If d is the published distance to airport, then driving speed may be inferred as : s = d/T
We believe that a closed-form mapping relationship between the PaaS-layer outputs O and the IaaS parameters par O = F (par,E) è par = F (O,E)
-
IaaS layer (resource allocation [Vc, Dc, …]
PaaS layer (runs service-level algorithm)
par E*
environment conditions
E: set of environment parameters known to
designer [⊆ E*]
Abstracted as a black-box (realizable with a closed-form
mapping procedure F)
service-level output O
Prototype system results (on UNIX-based LAN) fm: Max. # of VM failures assumed by replication protocol fa: Actual # of VM failures K: Total VM pool size (N ≤ K)
time-to-complete query experienced by a client
TTC (in msec)
W: promised network bandwidth Vc: promised VM cycles
TTC results when 5 instances of replication algorithm
share the resources (i.e., k=5)
TTC results when a single instance of replication algorithm uses the VM cycles & network bandwidth
δ: Increase in TTC due to resource sharing with other instances [change in fm from 2 to 3 for a given
(N,W,Vc) incurs a lower δ, where N ≥ 7]
delayed query result (TTC > Δ) is less useful è reduces “service availability”
250
750
1250
10 # of VMs participating in data replication (N)
[per instance of replication algorithm]
3 4 5 6 7 8 9
x
x x x o
o o o o
o
Δ=600 msec (say) 500
δ(2)
δ(3) k=5
k=5
k: # of instances of replication algorithm running concurrently
(k ≥ 1)
# of concurrent instances of replication algorithm sharing VM cycles and network bandwidth [k]
Maximum per-client
query completion
rate (λ per sec)
3 4 5 6 7 2 1
N=8; W=300 kbps; Vc=1.2 ghz
2
4
6
8
fm=2, fa=0 fm=3, fa=0
No reduction in Vc, except due to statistical mux
of VM cycles
20% reduction in Vc, due to propensity of CSP
to cheat on VM cycles
1≤ fm <⎡ N /2⎤; 0 ≤ fa ≤K
9 [each of the 8 replica nodes
runs k instances of voter task]
depicts cases of non-isolation experienced
Experimental results on replicated image data processing on clouds
cloud-based system implementing core services adaptation control
functions of application
QoS control actions (I)
observe parameters of
actual QoS (O) & environment (E)
Third-party verification mechanism [reasons about system behavior by online and offline testing methods]
composite system S
true model: O*=G*(I,s*,E*) [computational model of S
programmed in controller: O=G(I,s,E) ] state s*
Our model-based approach to SLA compliance checking hostile external
environment E*
Two main issues in design of cloud-based network systems 1. System certification How good a cloud-based network system S behaves relative to what S is supposed to provide to applications Existing efforts on verification of distributed algorithms (such as state-space analysis) have focused primarily on functional requirements of S E..g., does a data transfer protocol X (which is embodied in S) reliably transfer data packets in the presence of network loss ? Certification involves analyzing the non-functional attributes of S (say, how fast S reacts to VM resource outages in the cloud, how stable is the QoS output, . . . ) -- How can the compliance of S relative to a prescribed specs of S be ascertained ? Service auditing by a trusted third-party entity ??
2. Autonomic system composition How can the system S dynamically adjust its behavior to optimally respond to a failure ? fight-through capability of S --- Dynamic composition of modules involves software-engineering challenges --- Need to employ service-oriented architectures
Our research goal
Design of certifiable cloud-based distributed systems where the adaptation processes can be externally managed and reasoned about
A cloud-based network system S that cannot be certified about service quality is useless when S is deployed to meet mission-critical needs
--- even if S has been designed to function well
Being good is one thing, but being verifiably good is another thing [e.g., a student X with a GPA of 3.5 (say) is preferred for employment than a
student Y who is more knowledgeable than X but does not possess a GPA certificate]
We assign a score for each system that advertises a service to clients [score is in the range (0.0,1.0): 1.0(-) è best; 0+ è worst ] Systems are compared, based on scores, for risk assessment during deployments (scores enable measurable ways of system improvement)
DATA service domain: A web service X offers 90% availability under failures 10% failure probability of the VMs hosting the web servlets; whereas, a service Y offers only 85% availability under the same fault scenarios [buyer of DATA service makes a purchase decision based on the rating]
1. Formulate model-based engineering approach to determine the reasons for SLA violations
2. Machine-intelligence is employed in a system auditor tool to reason about the service compliance to a reference behavior (e.g., PO-MDP techniques)
Typical QoS metrics: latency, availability, consistency
Key ideas in our approach
3. Inject simulated attacks and stressor events to study the service resilience in various operating conditions.
Goal: Non-intrusive monitoring of cloud services by auditor
QoS specs q, algorithm parameters par, system resource allocation R are usually controllable inputs In contrast, environment parameters e ∈ E* are often uncontrollable and/or unobservable, but they do impact the service-level performance (e.g., component failures, network traffic fluctuations, attacks, etc)
environment parameter space: E* = E(yk) ∪ E(nk) ∪ E(ck)
parameters that the designer knows about
parameters that the designer does not
currently know about
parameters that the designer can never
know about Algorithm design decisions face this uncertainty --- so, designer makes
certain assumptions about the environment (e.g., no more than 2 nodes will fail during execution of a distributed algorithm).
When assumptions get violated, say, due to attacks, algorithms fall short of what they are designed to achieve
è Evaluate how good an algorithm performs under strenuous conditions
Modeling of external environment
incidence of hostile environment condition E*
actual QoS experienced
by application (output)
system state φ visible at INT
Controller
(compute resource adjustments)
corrective action reference
QoS Pref (input)
Observer (state-to-QoS
mapping)
control-theoretic loop to realize QoS-to-resource mapping
AD
APT
IVE
APP
LIC
ATIO
N
(infrastructure model programmed
into controller)
QoS auditor H
SI
y z p q
cloud-based service-infrastructure
Storage & processing
INT
INT: service interface
Network connects
INT
achieved QoS P’ (steady-state)
ε=Pref-P’ QoS tracking error
[ε, Pref, φ, E]
LOG Intelligent management entity
reason about application behavior (resilience, performance, robustness, . .)
service-support system S
ε is a measure of how trust-worthy
the system is
control-theoretic view of QoS adaptation in a cloud service
(partial knowledge about failures/attacks)
distributed softw
are system
S adaptation processes
core-functional elements (algorithms,
cloud resources, . .)
signal flows
Man
agem
ent
entit
y H
evaluate functional behavior [FUNC(S),Pref,Pact)]
evaluate para-functional behavior [PARA(S),Pref,Pact),E]
input reference Pref
actual output Pact
trigger
APPLICATION-LEVEL USER
incidence of uncontrolled
environment E
certify(S) {100%-good, 90%-good, . .,
bad,…}
[quality(S)=good] AND [certify(S)=good] è [accuracy(H)=high]; Axioms of system observation
[quality(S)=good] AND [certify(S)=bad] è [accuracy(H)=low]; [quality(S)=bad] AND [certify(S)=bad] è [accuracy(H)=high]; [quality(S)=bad] AND [certify(S)=good] è [accuracy(H)=low]. [quality(S)=80%-good] AND [certify(S)=90%-good] è [accuracy(H)=medium];
monitor
Certifying non-functional attributes of system behavior
Future research plans
• Injection of attack and stressor events on cloud-based system being tested
• Incorporation of system utility functions and SLA penalty as part of dependability analysis of cloud systems
• Identification of probabilistic measures of system quality
• Machine-intelligence and Markov decision processes for system analysis
• Cyber-physical systems methods for autonomic system improvement