comprehensive depiction of configuration-dependent performance anomalies in distributed server...

23
Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas O’Neill University of Rochester Presented at the 2 nd USENIX Workshop on Hot Topics in System Dependability

Upload: dario-trowbridge

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

Comprehensive Depiction of Configuration-dependent Performance

Anomalies in Distributed Server Systems

Christopher Stewart, Ming Zhong,

Kai Shen, and Thomas O’Neill

University of Rochester

Presented at the 2nd USENIX Workshop on Hot Topics in System Dependability

Page 2: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

2

Context

Distributed server systems Example: J2EE Application

servers Many system configurations

Switches that control runtime execution

Wide range of workload conditions exogenous demands for system

resources

Example J2EE

Runtime Conditions

System Configurations

Concurrency limit

Component placement

Workload Conditions

Request rate

Page 3: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

3

Presumptions Performance expectations based on

knowledge of system design are reasonable Lead developers–high-level algorithms Administrators–day-to-day experience

Example ExpectationLittle’s Law

Average number of requests in the system equals the average arrival rate times service time

Page 4: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

4

Th

rou

gh

pu

t

Anomalies

Actual

Expectation

Component Placement Strategies

Real Performance Anomalies

Problem Statement Dependable performance is

important for system management QoS scheduling SLA negotiations

Performance Anomalies – runtime conditions in which performance falls below expectations – are not uncommon

Page 5: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

5

Goals Previous Work: Anomaly characterization can aid the

debugging process and guide online avoidance [AGU-SOSP99, QUI-SOSP05, CHE-NSDI04, COH-SOSP05, KEL-WORLDS05]

Focused on specific runtime conditions (e.g., those encountered during a particular execution)

We wish to depict all anomalous conditions

Comprehensive depictions can: Aid the debugging of production systems before

distribution Enable preemptive avoidance of anomalies in live

systems

Page 6: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

6

Approach Our depictions are derived in a 3-step process:

1. Generate performance expectations by building a comprehensive whole-system performance model

2. Search for anomalous runtime conditions

3. Extrapolate a comprehensive anomaly depiction

Challenges: Model must consider a wide-range of system configurations Systematic method to determine anomaly error threshold An appropriate method to detect correlations between runtime

conditions and anomalies

Page 7: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

7

Outline Performance expectations for a wide-range of

configuration settings

Determination of the anomaly error threshold

Decision-tree based anomaly depiction

Preliminary results

Discussion/ Conclusion

Page 8: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

8

Comprehensive Performance Expectations Modeling the configuration space is hard

Configurations have complex effects on performance Considering a wide-range of configurations increases

model complexity

Our modeling methodology Build performance models as a hierarchy of sub-models Sub-models can be independently adjusted to consider

new system configurations

Page 9: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

9

Rules for Our Sub-Model Hierarchies The output of each sub-model is a workload property

Workload property – internal demands for resources (e.g., CPU consumption)

The inputs to each sub-model are either1. workload properties 2. system configuration settings

Sub-models on the highest level produce performance expectations

Workload properties at the lowest level, canonical workload properties, can be measured independent of system configurations

Page 10: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

10

A Hierarchy of Sub-Models We leverage the

workload properties of earlier work [STE-NSDI05]

Advantages Sub-models have

meaning

Limitations Configuration

dependencies may make sub-models complex

Sub-model 3:average request CPU usage at each machine

Sub-model 4:average request comm. need at each machine

Workload property:component CPU

usage w/o caching

Sub-model 1: average request CPU usage at

each component

Sub-model 2: average request comm. need at

each component

Sub-model 5: average request response time

Configuration:cache coherence

Configuration:cache coherence

Workload property:component comm. need w/o caching

Configuration:component placement

Configuration:remote

invocation method

Configuration:component placement

Configuration:service

concurrency level

Sub-model 6:system

throughput

Hierarchy of sub-models for J2EE application servers.

Page 11: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

11

Outline Performance expectations for a wide-range of

configuration settings

Determination of the anomaly error threshold

Decision-tree based anomaly depiction

Preliminary results

Discussion/ Conclusion

Page 12: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

12

Determination of the Anomaly Error Threshold Sometimes slight discrepancies between actual and

expected performance should be tolerated

Leniency depends on the end-use of the depiction

For online avoidance: focus on error magnitude Large errors may induce poor management decisions Sensitivity analysis of system management functions

For debugging: focus on targeted performance bugs Noisy depictions will mislead debuggers Group anomalies with the same root cause

Page 13: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

13

Anomaly Error Threshold for Debugging Observation: anomaly manifestations due to

the same cause are more likely to share similar error magnitude than unrelated anomaly manifestations

Root causes can be grouped by clustering based on the expectation error:

Page 14: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

14

Anomaly Error Threshold for Debugging Knee-points mark

clusters boundaries

Knee-point selection Higher magnitude

emphasizes large anomalies

Low magnitude captures multiple anomalies

Validation: we notice that knee points disappear when problems are resolved

100%

0 400 800 1200 1600

80%

60%

40%

20%

0%

knee

knee

kneeknee

kneeResponse Time

Tput

Sample runtime conditions (sorted on expectation error)

Expectation Error Clustering

Page 15: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

15

Outline Performance expectations for a wide-range of

configuration settings

Determination of the anomaly error threshold

Decision-tree based anomaly depiction

Preliminary results

Discussion/ Conclusion

Page 16: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

16

Decision Tree Based Anomaly Depictions Decision trees correlate anomalies to problematic runtime conditions

Interpretable Unlike

Neural Nets, SVM, Perceptrons

No prior knowledge Unlike

Bayesian trees [COH-OSDI04]

Versatile

If a=0: anomaly

If a=1,b=0: normal

If a=1,b=1: anomaly

White-box Usage for Debugging

Hints

Prefer shorter, easily interpreted trees

Black-box Usage for Avoidance

Prefer longer, more precise tree

a a

b b b

c c c c

Anomaly80% prob.

Normal70% prob.

Anomaly90% prob.

0 1 0 1

a=0,b=1,c=2,….

Anomaly

Normal

Page 17: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

17

Design Recap We wish to depict performance anomalies

across a wide-range of system configurations and workload conditions

1. Derive performance expectations via a hierarchy of sub-models

2. Search for anomalous runtime conditions with carefully selected anomaly error threshold

3. Use decision trees to extrapolate a comprehensive anomaly depiction

Page 18: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

18

Outline Performance expectations for a wide-range of

configuration settings

Determination of the anomaly error threshold

Decision-tree based anomaly depiction

Preliminary results

Discussion/ Conclusion

Page 19: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

19

Depiction Assisted Debugging System: JBoss

8 runtime conditions (including app type) 4 machine cluster, 2.66 GHz CPU

Found and fixed 3 performance anomalies One is shown in detail below

Application type

Componentplacement strategy

79%anomalous

68%anomalous

87%anomalous

88%anomalous

Container-managed persistence (CMP)

Node Comp. 1 {2} 2 {1,3,5} 3 {4}

1 {4,5} 2 {1,2,3} 3 null

1 null2 {1,2,4}3 {3,5}

1 {5}2 {1,2,4}3 {3}

Depiction of a real performance anomaly.

Misunderstood J2EE configuration which manifests when multiple components are placed on node 2

Page 20: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

20

Discovered Anomalies

1. Misunderstood J2EE configuration caused remote invocations to unintentionally execute locally

2. A mishandled out-of-memory error under high concurrency caused the Tomcat 5.0 servlet container to drop requests

3. Circular dependency in the component invocation sequences caused connection timeouts under certain component placement strategies

Page 21: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

21

Outline Performance expectations for a wide-range of

configuration settings

Determination of the anomaly error threshold

Decision-tree based anomaly depiction

Preliminary results

Discussion/ Conclusion

Page 22: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

22

Discussion Limitations

Cannot detect non-deterministic anomalies Is it model inaccuracy or a performance anomaly?

Requires manual investigation, but model is much less complex than the system

Debugging is still a manual process

Future work Short term: Investigate more system configurations Short term: Depict anomalies in more systems Long term: More systematic depiction-assisted

debugging methods

Page 23: Comprehensive Depiction of Configuration-dependent Performance Anomalies in Distributed Server Systems Christopher Stewart, Ming Zhong, Kai Shen, and Thomas

23

Take Away Comprehensive depictions of performance

anomalies on a wide-range of runtime conditions can aid debugging and avoidance

We have designed and implemented an approach to: Model a wide-range of system configurations Determine anomalous conditions Depict the anomalies in an easy-to-interpret fashion

We have already used our approach to find 3 performance bugs