latency-aware elastic scaling for distributed data stream processing systems
DESCRIPTION
Elastic scaling allows a data stream processing system to react to a dynamically changing query or event workload by automatically scaling in or out. Thereby, both unpredictable load peaks as well as underload situations can be handled. However, each scaling decision comes with a latency penalty due to the required operator movements. Therefore, in practice an elastic system might be able to improve the system utilization, however it is not able to provide latency guarantees defined by a service level agreement (SLA). In this paper we introduce an elastic scaling system, which optimizes the utilization under certain latency constraints defined by a SLA. Specifically, we present a model, which estimates the latency spike created by a set of operator movements. We use this model to build a latency-aware elastic operator placement algorithm, which minimizes the number of latency violations. We show that our solution is able to reduce the 90th percentile of the end to end latency by up to 30% and reduce the number of latency violations by 50%. The achieved system utilization for our approach is comparable to a scaling strategy, which does not use latency as optimization target.TRANSCRIPT
Public
Latency-aware Elastic Scaling of Distributed
Data Stream Processing Systems
Thomas Heinze, Zbigniew Jerzak, Gregor Hackenbroich, Christof Fetzer
May 27, 2014
Public
Utilization in Cloud Environments
Cluster of Twitter[1] has average CPU utilization < 20%, however ~80% of the resources are reserved
Google Cluster Trace[2] shows an average 25-35% CPU utilization and 40% memory
The average utilization in public clouds is estimated to be between 6% to 12%[3].
1 Benjamin Hindman, et al. "Mesos: A platform for fine-grained resource sharing in the data center." NSDI, 2011.
2 Charles Reiss, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SOCC, 2012.
3 Arunchandar Vasan, et al. “Worth their watts? An empirical study of datacenter servers“. HPCA,2010.
Public
Elasticity
Users needs to reserve required resources Limited understanding of the performance of the system
Limited knowledge of characteristics of the workload
Workload Load
Static Provisioning
Elastic Provisioning
time
Underprovisioning
Overprovisioning
Public
Elastic Data Stream Processing[1,2]
Long standing continous queries over potential infinite data
stream
Small memory footprint (MB – GB) for most use cases, fast
scale out
Strict requirements on end to end latency
Unpredictable workload, high variability (within seconds to
minutes)
Load balancing influences running queries
Input Streams
Data Stream Processing Engine Output Streams
[1] V. Gulisano, et al. “Streamcloud: An Elastic and Scalable Data Streaming System.” IEEE TPDS, 2012 [2] R. Fernandez, et al. „Integrating scale out and fault tolerance in stream processing using operator state management”, SIGMOD 2013.
Public
Impact on End to End Latency
0
10
20
30
40
50
60
0
2
4
6
8
10
12
1 101 201 301 401
Qu
ery
Co
un
t
Ho
st C
ou
nt
Time Unit
Used Hosts Query Count
0
10
20
30
40
-2
0
2
4
1 101 201 301 401
Mo
ved
Op
era
tors
Late
ncy
[s]
Time Unit
Latency Moved Operator
Public
Outline
1. Introduction
2. Movement Cost Modeling
3. Latency-aware Elastic Scaling
4. Evaluation
5. Conclusion and Future Work
Public
Operator Movement Protocol
Pause & Resume Protocol[1]
Host 2 Host 1 Host 3
Host 4
F1 D1 A1
A1
State
F1
[1] M. Shah, et al. "Flux: An adaptive partitioning operator for continuous query systems." In ICDE, 2003.
Public
Movement Cost Model
ql(op) = pauseTime(op) x inputRate(op, t)
pauseTime(op) = max(moveTime(o)| o ϵ succ(op)) op ϵ opsmoved
moveTime(o) = f(stateSize(o),|opsmoved|, load(h, t))
delayproc(op) = 0.5 x ql(op) x procTime(op, t)
latSpike(op) = pauseTime(op) + delayproc(op) - delayarrival(op)
delayarrival(op) = 0.5 x ql(op) x inputRate(op, t)
lat(qi, t + 1) = lat(qi, t)+ ∑ (latSpike(op)) op ϵ pausedOps(qi)
Public
Validation
Measured
Small Medium Large
Esti
mat
ed
Small 68% 0 0
Medium 19% 7% 0
Large 2% 1% 3%
Estimate Latency Peak for 378 Scaling Decisions
Three classes:
small: 0.0s < lat < 5.0s
medium: 5.0s < lat <10.0s
large: lat > 10.0s
High precision + no under estimation
Public
LATENCY-AWARE ELASTIC SCALING
Public
Latency-aware Elastic Scaling
User-defined threshold for maximum latency
Differentiate between necessary and optional scaling
decisions
Necessary: Need to response to overloaded hosts
Optional: Host release can be postponed
General idea: try to scaling decision where movement cost is
below latency threshold
Public
When should we scale ?
Upper Threshold: „If CPU utilization of host is larger than x for y
seconds, host is marked as overloaded.“
Lower Threshold: „If average CPU utilization of all hosts is small than
z for w seconds, host is marked as underloaded.“
Additional parameters: Target Utilization, Grace Period
Public
Latency-aware Scaling
Scale out Decisions:
- Move subset of operators to remove overload (subset problem)
- Best solution: set of operators creating minimal latency peak
Scale in Decisions:
- Release host with minimal latency peak, if possible
- If estimated latency >threshold: move away subset of operators from host with minimal peak
Public
EVALUATION
Public
System Architecture[1]
Elastic CEP Engine Elastic CEP Engine FU
GU
Operator Placement
Data Stream Processing Engine
Processing Coordination
Output Streams
Input Streams
1 Thomas Heinze, et al. “Auto-scaling techniques for Elastic Data Stream
Processing” SMDB, 2014.
Public
Setup
Private cloud environment with 10 hosts
Financial workload taken from Frankfurt stock exchange
Varying query workload with up to 35 queries
Measure characteristics like CPU load, latency, etc. in 10
seconds intervals
Latency is measured based on number of violations (
>2,3,4,5 sec.)
Two baseline algorithms: State size heuristic, CPU Load
heuristic
Public
0
1
2
3
4
5
6
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
2 sec. 3 sec. 4 sec.
Late
ncy
(s)
Uti
lizat
ion
Latency Threshold
Latency Avg. Latency Avg. Utilization
Min Utilization Max Utilization
0
20
40
60
80
100
120
140
2 sec. 3 sec. 4 sec.
Nu
mb
er V
iola
tio
ns
Latency Threshold
2 sec. 3 sec. 4 sec. 5 sec.
Different Latency Thresholds
Latency Threshold is reflected
Public
Comparison with Baseline
0
10
20
30
40
50
60
70
80
90
100
Our Model StateSize CpuLoadN
um
ber
vio
lati
on
s
Selection Strategy
2 sec. 3 sec. 4 sec. 5 sec.
0
1
2
3
4
5
6
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
Our Model StateSize CPULoad
Late
ncy
(s)
Uti
lizat
ion
Selection Strategy
Latency Avg. Latency Avg.Utilization
Min. Utilization Max. Utilization
Our model outperforms state of the art solutions.
Public
Different Utilization Thresholds
0
20
40
60
80
100
120
140
160
0.3 0.4 0.5N
um
ber
vio
lati
on
s
Lower Threshold
2 sec. 3 sec. 4 sec. 5 sec.
Our model slows down too aggressive scaling policies
0
1
2
3
4
5
6
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0.3 0.4 0.5
Late
ncy
(s)
Uti
lizat
ion
Lower Threshold
Latency Avg. Latency Avg. Utilization
Max. Utilization Min. Utilization
Public
Summary
Elastic scaling of a data stream processing system has negative effects
on end to end latency
We introduced model for estimating operator movement cost
Significantly reduced number of latency violations by latency-aware
elastic scaling (up to 50% less violations)
Future Work:
Still latency violations due to overload, try to employ pro-active scaling
strategies