performance debugging in data centers: doing more with less prashant shenoy, umass amherst joint...

Performance Debugging inData Centers:

Doing More with Less

Prashant Shenoy, UMass Amherst

Joint work with Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal and Harrick Vin

Data Centers Today

• Large number of computing, communication, and storage systems

• Wide range of applications and services• Rapidly increasing scale and complexity• Limited understanding and control over the

operations

Equity Trade Plant

Portion of the data center operated by an investment bank for processing trading orders; Nodes represent application processes; Edges indicate flow of requests;

Equity Trade Plant

• Receives and processes • 4-6 million equity orders (trade requests)• 10-100 million market updates (news, stock-tick

updates, etc.)• IT infrastructure for processing orders and updates

consists of thousands of application components running on hundreds of servers

Portion of the data center operated by an investment bank for processing trading orders; Nodes represent application processes; Edges indicate flow of requests;

Performance Debugging in Data Centers

• Low end-to-end latency for processing each request is a critical business requirement

• Increase in latency can be due to– Dynamic changes in workload– Slowing down of a processing node due to hardware

or software errors• Performance debugging involves detecting and

localizing performance faults• Longer localization time leads to greater business

impact


• Four key steps– Build a model of normal operations of a system– Place probes to monitor the operational system– Detect performance faults in near-real-time– Localize faults by combining the knowledge

derived from model and monitored data


• Four key steps– Build a model of normal operations of a system– Place probes to monitor the operational system– Detect performance faults in near-real-time– Localize faults by combining the knowledge

derived from model and monitored data

Effectiveness of these steps depends on the number and type of data collection probes available in the system.

However, system administrators are reluctant to introduce probes into production environment, especially if the probes are intrusive (and can modify the system behavior)

Basic Practical Requirement

• Minimize the amount of instrumentation to gather real-time operational statistics

• Minimize the intrusiveness of the data gathering methods


• Minimize the amount of instrumentation to gather real-time operational statistics

• Minimize the intrusiveness of the data gathering methods

Much of the prior research ignores this requirement and demands:• Significant instrumentation

• (e.g., requiring probes to be placed at each process/server)• Significant intrusiveness

•(e.g., requiring each request to carry a request-ID to track request flows)

Characterizing State-of-the-art


System operators are always • Minimize the amount of instrumentation to

gather real-time operational statistics• Minimize the intrusiveness of the data

gathering methods

Much of the prior research ignores this requirement and demands:• Significant instrumentation

• (e.g., requiring probes to be placed at each process/server)• Significant intrusiveness

•(e.g., requiring each request to carry a request-ID to track request flows)

• For automated performance debugging to become practical and effective, one needs to develop techniques that are more effective with less instrumentation and intrusiveness

• We raise several issues and challenges in designing these techniques

Instrumentation Vs. Intrusiveness

• Extent of instrumentation and amount of intrusiveness complement each other– E.g., collection of request component dependency• High instrumentation-Low intrusiveness

– Each node monitors request arrival event

• Low instrumentation-High intrusiveness– Each request stores information of the component it passes

through


• Extent of instrumentation and amount of intrusiveness complement each other– Collection of request component dependency– High instrumentation-Low intrusiveness• Each node monitors request arrival event

– Low instrumentation-High intrusiveness• Each request stores information of the component it

passes through

Observation: It is possible to tradeoff the level of instrumentation against the level of intrusiveness

needed for a technique




passes through

Observation: It is possible to tradeoff the level of instrumentation against the level of intrusiveness


Production systems place significant restrictions on which nodes can be instrumented as well as the level of

intrusiveness permitted




passes through

Observation 3: It is possible to tradeoff the level of instrumentation against the level of intrusiveness


Production systems place significant restrictions on which nodes can be instrumented as well as the level of

intrusiveness permitted

Is it possible to achieve effective performance debugging using low instrumentation and low

intrusiveness?

Doing More With Less: An Example

A Production Data Center: Characteristics and Constraints


• 469 nodes– Each node represents an

application component that processes trading orders and forwards them to downstream node

• 2,072 links• 39,567 unique paths• SLO: end-to-end latency

for processing each equity trade should not exceed 7-10ms


• 469 nodes– Each node represents an

application component that processes trading orders and forwards them to downstream node

• 2,072 links• 39,567 unique paths• SLO: end-to-end latency

for processing each equity trade should not exceed 7-10ms

• Environment imposes severe restrictions on the permitted instrumentation and intrusiveness• No instrumentation of intermediate nodes purely for performance debugging• SLA compliance is monitored at exit nodes by time-stamping request entry and exit

Available information• Per-hop graph• SLO compliance information at the monitors at exit nodes

No additional information is available

Problem Definition

• Given:– System graph depicting application component

interactions– Instrumentation at the entry and exit nodes that

timestamp requests

• Determine:– The root cause of SLO violations when one more

exit nodes observe such violations

Straw-man Approaches

• Signature-based localization• Online signature matching via graph coloring

Signature-Based Localization• Node signature:

– Set of all monitors that are reachable from the node

– K-bit string where each bit represents the accessibility of a monitor

• In presence of a failure some monitors will observe SLO violation, thus creating a violation signature

• Fault localization task is to determine the node that could have generated the violation signature

Query exit points(SLA validation)

01101000

1110

1110

0001

0001

11111110

1000 0100 00010010

Signature-Based Localization

Applying signature-based localization on equity trade plant system

• Monitors on 112 exit nodes generated 112-bit signatures • Generated 137 unique signatures for 357 non-exit nodes (38%)• Generated 71 unique signatures for 121 source nodes (58%)

Online signature matching

• Graph coloring technique

SLA violation

Mark suspect nodes

Clear suspect nodes that lead to a valid request execution

Root cause ofSLA violation

Opportunities and Challenges

Deriving a System Model• Objective:– Real production systems are too large and complex to

manually derive a system model• Need for automatic generation and maintenance of model

• Challenges:– Need for reasonably low instrumentation and

intrusiveness– Several low-cost mechanisms can be considered here

• Network packet sniffing to derive component communication pattern

• Examining application logs – to derive component communication pattern– to derive request flows

Monitor Placement

• Objective:– Place monitors at suitable locations to measure end-

to-end performance metrics• Challenges– Deployment of monitors involves instrumentation

overhead• Need to minimize the number of monitors

– Tradeoff between number of monitors and accuracy of fault detection and localization• Smaller number of monitors increases chances of signature

collisions

Monitor Placement

– Structure of graph affects the distribution of signatures across nodes• In the ideal case n unique signatures can be generated

using log(n) monitors

Nod

es w

ith

sam

e si

gnat

ure

Nodes w

ith sam

e signature

Real-Time Failure Detection

• Objective– Quick and accurate detection of the presence of

failures based on observation at the monitor nodes

• Challenges:– Differentiate between the effect due to workload

change and failure– Deal with scenario where a node failure affects

only few of the requests passing through the node– Transient failures

Fault Localization

• Objective:– Identification of the root-cause of the problem after

detecting failure at one or more monitor nodes (SLO violation signature)

• Challenges:– Presence of multiple failures leads to composite signature– Edges from the failed node to the monitors are traversed

in a non-uniform manner leading to partial signature– Transient failures– Inherent non-determinism in real systems (e.g. presence

of load balancers)

Conclusions

• Detecting and localizing performance faults in data centers has become a pressing need and a challenge

• Performance debugging can become practical and effective only if it requires low levels of instrumentation and intrusiveness

• We proposed straw man approaches for performance debugging and presented issues and challenges for building practical and effective solutions

Instrumentation and Intrusiveness

Observation 1: The instrumentation intrusiveness is a direct function of the performance metric of interest

Instrumentation for Failure Detection

– End-to-end latency: difference of the timestamps of arrival and departure of requests• High instrumentation intrusiveness

– Throughput: number of requests departing the system within a defined interval• Low instrumentation intrusiveness

Instrumentation for Fault Localization

• Simple solution: Measure performance metrics and resource utilization at all servers– High instrumentation– High overhead (monitoring and data management)

• Sophisticated solutions: Collect operational semantics of the system (e.g., request component dependencies)– Low instrumentation (not each node needs to be

instrumented)– High intrusiveness (modifications at system,

middleware, application level)

Instrumentation for Fault Localization

• Collection of different system information require different level of intrusiveness• Per-hop graph indicating component interactions: simple

network sniffing • Derivation of flow of requests: application aware

monitoring (e.g. by insertion of transaction-id in the requests)

Characterizing State-of-the-art

Observation 2: Most techniques require high instrumentation or high intrusiveness or both

performance debugging in data centers: doing more with less prashant shenoy, umass amherst joint...

Documents

data slide

data effectiveness

system behavior slide

operational system

flow of requests slide

system place probes

localizing performance

processing orders