roc retreat 1/14/2003 connecting the dots: using runtime paths for macro analysis mike chen...
TRANSCRIPT
ROC Retreat 1/14/2003
Connecting the Dots:Using Runtime Paths for Macro Analysis
Mike [email protected]://pinpoint.stanford.edu
ROC Retreat 1/14/2003
Motivation Divide and conquer, layering, and replication are
fundamental design principles– e.g. Internet systems, P2P systems, and sensor networks
Execution context is dispersed throughout the system=> difficult to monitor and debug
Lots of existing low-level tools that help with debugging individual components, but not a collection of them– Much of the system is in how the components are put
together
Observation: a widening gap between the systems we are building and the tools we have
ROC Retreat 1/14/2003
Peer Peer
Peer Peer
Peer Peer
Peer
Sensor Sensor
Sensor Sensor
Sensor Sensor
Sensor
Current Approach
Apache Apache
JavaBean
JavaBean
Database Database
JavaBean
ROC Retreat 1/14/2003
Current Approach Micro analysis tools, like
code-level debuggers (e.g. gdb) and application logs, offers details of each individual component
Scenario– A user reports request A1
failed– You try the same request,
A2, but it works fine– What to do next?
Apache Apache
JavaBean
JavaBean
Database Database
JavaBean
gdbX = 1Y = 2
A1 A2
X = 3Y = 1
X = 3Y = 2
ROC Retreat 1/14/2003
Macro Analysis Macro analysis exploits non-local context
to improve reliability and performance– Performance examples: Scout, ILP, Magpie
Statistical view is essential for large, complex systems
Analogy: micro analysis allows you to understand the details of individual honeybee; macro analysis is needed to understand how the bees interact to keep the beehive functioning
ROC Retreat 1/14/2003
Observation Systems have a single system-wide
execution paths associated with each request– E.g. request/response, one-way messages
Scout, SEDA, Ninja use paths to specify how to service requests
Our philosophy– Use only dynamic, observed behavior– Application-independent techniques
ROC Retreat 1/14/2003
Our Approach Use runtime paths to
connect the dots!– dynamically captures the
interactions and dependency between components
– look across many requests to get the overall system behavior
• more robust to noise
Components are only partially known (“gray boxes”)
Apache Apache
JavaBean
JavaBean
Database Database
JavaBean
path
paths
ROC Retreat 1/14/2003
Apache Apache
JavaBean
JavaBean
JavaBean
JavaBean
JavaBean
paths
Apache Apache
JavaBean
JavaBean
Database Database
JavaBean
Sensor Sensor
Sensor Sensor
Sensor Sensor
Sensor
paths
Peer Peer
Peer Peer
Peer Peer
Peer
paths
Our Approach Applicable to a wide range of systems.
A B
D E
F G
C
paths
ROC Retreat 1/14/2003
Open Challenges in Systems Today
1. Deducing system structure– manual approach is error-prone– static analysis doesn’t consider resources
2. Detecting application-level failures– often don’t exhibit lower-level symptoms
3. Diagnosing failures– failures may manifest far from the actual faults– multi-component faults
Goal: reduce time to detection, recovery, diagnosis, and repair
ROC Retreat 1/14/2003
Talk Outline Motivation Model and architecture Applying macro analysis Future directions
ROC Retreat 1/14/2003
Runtime Paths Instrument code to dynamically trace
requests through a system at the component level– record call path + the runtime properties – e.g. components, latency, success/failure, and
resources used to service each request Use statistical analysis detect and diagnose
problems– e.g. data mining, machine learning, etc.
Runtime analysis tells you how the system is actually being used, not how it may be used
Complements existing micro analysis tools
ROC Retreat 1/14/2003
Reusable Analysis Framework
Architecture Tracer
– Tags each request with a unique ID, and carries it with the request throughout the system
– Report observations (component name + resource + performance properties) for each component
Aggregator + Repository– Reconstructs paths and
stores them Declarative Query Engine
– Supports statistical queries on paths
– Data mining and machine learning routines
Visualization
A
Tracer
C
Tracer
B
Tracer
D
Tracer
E
Tracer
F
Tracer
Aggregator
Path Repository
Query Engine
Visualization
Developers/Operators
request
ROC Retreat 1/14/2003
Request Tracing Challenge: maintaining an ID with each request
throughout the system Tracing is platform-specific but can be application-
generic and reusable across applications 2 classes of techniques
– Intra-thread tracing• Use per-thread context to store request ID (e.g. ThreadLocal
in Java)• ID is preserved if the same thread is used to service the
request – Inter-thread tracing
• For extensible protocols like HTTP, inject new headers that will be preserved (e.g. REQ_ID: xx)
• Modify RPC to pass request ID under the cover• Piggyback onto messages
ROC Retreat 1/14/2003
Talk Outline Motivation Model and architecture Applying macro analysis
– Inferring system structure– Detection application-level failures– Diagnosing failures
Future directions
ROC Retreat 1/14/2003
Inferring System Structure Key idea: paths directly capture application
structure
2 requests
ROC Retreat 1/14/2003
Indirect Coupling of Requests Key idea: paths associate requests with
internal state Trace requests from web server to database
– Parse client-side SQL queries to get sharing of db tables
– Straightforward to extend to more fine-grained state (e.g. rows)
Requesttypes
Database tables
ROC Retreat 1/14/2003
Failure Detection and Diagnose Detecting application-level failures
– Key idea: paths change under failures => detect failures via path changes.
Diagnosing failures– Key idea: bad paths touch root cause(s). Find
common features.
ROC Retreat 1/14/2003
Future Directions Key idea: violation of macro invariants are
signs of buggy implementation or intrusion Message paths in P2P and sensor networks
– a general mechanism to provide visibility into the collective behavior of multiple nodes
– micro or static approaches by themselves don’t work well in dynamic, distributed settings
– e.g. algorithms have upper bounds on the # of hops
• Although hop count violation can be detected locally, paths help identify nodes that route messages incorrectly
– e.g. detecting nodes that are slow or corrupt msgs
ROC Retreat 1/14/2003
Conclusion Macro analysis fills the need when monitoring and
debugging systems where local context is of insufficient use
Runtime path-based approach dynamically traces request paths and statistically infer macro properties
A shared analysis framework that is reusable across many systems– Simplifies the construction of effective tools for other
systems and the integration with recovery techniques like RR
http://pinpoint.stanford.edu– Paper includes a commercial example from Tellme!
(thanks to Anthony Accardi and Mark Verber)
ROC Retreat 1/14/2003
Backup Slides
ROC Retreat 1/14/2003
Backup Slides
ROC Retreat 1/14/2003
Current Approach Micro analysis tools, like
code-level debuggers (e.g. gdb) and application logs, offers details of each individual component
Apache Apache
JavaBean
JavaBean
Database Database
JavaBean
gdbX = 1Y = 2
JavaBeanX = 1Y = 2
JavaBeanX = 2Y = 3
JavaBeanX = 5Y = 2
JavaBeanX = 3Y = 2
JavaBeanX = 2Y = 4
JavaBeanX = 7Y = 1
ROC Retreat 1/14/2003
Related Work Commercial request tracing systems
– Announced in 2002, a few months after Pinpoint was developed– PerformaSure and AppAssure focus on performance problems.– IntegriTea captures and playback failure conditions.– Focus on individual requests rather than overall behavior, and on
recreating the failure condition. Extensive work in event/alarm correlation, mostly in the context
of network management (i.e. IP)– Don’t directly capture relationship between events – Rely on human knowledge or use machine learning to suppress
alarms. Distributed debuggers
– PDT, P2D2, TotalView, PRISM, pdbx– Aggregates views from multiple components, but do not capture
relationship and interaction between components– Comparative debuggers: Wizard, GUARD
Dependency models– Most are statically generated and are likely to be inconsistent.– Brown et al. takes an active, black box approach but is invasive.
Candea et al. dynamically trace failures propagation.
ROC Retreat 1/14/2003
1. Detecting Failures using Anomaly Detection
Key idea: paths change under failures => detect failures via path changes
Anomalies– Unusual paths– Changes in distribution– Changes in latency/response time
Examples:– Error paths are shorter.– User behavior changes under failures
• Retries a few times then give up Implement as long running queries (i.e. diff) Challenges:
– detecting application-level failures– comparing sets of paths
ROC Retreat 1/14/2003
2. Root-cause Analysis Key idea: all bad paths touch root cause, find
common features Challenge: a small set of known bad paths and a
large set of maybes Ideally want to correlate and rank all
combinations of feature sets– E.g. association rules mining– May get false alarms because the root cause may not
be one of the features Automatic generation of dynamic functional and
state dependency graphs– Helps developers and operators understand inter-
component dependency and inter-request dependency– Input to recovery algorithms that use dependency
graphs
ROC Retreat 1/14/2003
3. Verifying Macro Invariants Key idea: violations of high-level invariants are signs
of intrusion or bugs Example: Peer auditing
– Problem: A small number of faulty or malicious nodes can bring down the system
– Corruption should be statistically visible in your behavior• look for nodes that delay or corrupt messages or route
messages incorrectly– Apply root-cause analysis to locate the misbehaving peers
• Some distributed auditing is necessary Example: P2P implementation verification
– Problem: are messages delivered as specified by the algorithms?
– Detect extra hops, loops, and verify that the paths are correct
– Can implement as a query:• select length from paths where (length > log2(N))
ROC Retreat 1/14/2003
4. Detecting Single Point of Failure Key idea: paths converge on a
single-point of failure Useful for finding out what to
replicate to improve availability
P2P example:– Many P2P systems rely on
overlay networks, which typically are networks built on top of the IP infrastructure.
– It’s common for several overlay links to fail together if they depend on a shared physical IP link that failed
Implement as a query:– intersect edge.IP_links from paths
Peer Peer
Peer Peer
Peer Peer
Peer
Sensor Sensor
Sensor Sensor
Sensor Sensor
Sensor
A B
D E
F G
C D
ROC Retreat 1/14/2003
5. Monitoring of Sensor Networks An emerging area with primitive tools Key idea: use paths to reconstruct
topology and membership Example:
– Membership• select unique node from paths
– Network topology • for directed information dissemination
Challenge: limited bandwidth– Can record a (random) subset of the nodes for
each path, then statistically reconstruct the paths
ROC Retreat 1/14/2003
Macro Analysis Look across many requests to get the
overall system behavior– more robust to noise
Request 1
Request 2 Request 3 Request 4
Component A X X X
Component B X
Component C X X
Macro Analysis
ROC Retreat 1/14/2003
Properties of Network Systems Web services, P2P systems, and sensor
networks can have tens of thousands of nodes each running many application components
Continuous adaptation provides high availability, but also makes it difficult to reproduce and debug errors
Constant evolution of software and hardware
ROC Retreat 1/14/2003
Motivation Difficult to understand and debug network
systems– e.g. Clustered Internet systems, P2P systems and sensor
networks– Composed of many components– Systems are becoming larger, more dynamic, and more
distributed Workload is unpredictable and impractical to
simulate– Unit testing is necessary but insufficient. Components
break when used together under real workload Don’t have tools that capture the interactions
between components and the overall behavior– Existing debugging tools and application-level logs only
do micro analysis
ROC Retreat 1/14/2003
Macro vs Micro Analysis
Macro Analysis
Micro Analysis
Resolution Component. Complements micro analysis tools.
Line or variable
Overhead Low. Can use it in actual deployment.
High. Typically not used in deployment other than application logs.
ROC Retreat 1/14/2003
What’s a dynamic path? A dynamic path is the
(control flow + runtime properties) of a request– Think of it as a stack trace
across process/machine boundaries with runtime properties
– Dynamically constructed by tracing requests through a system
Runtime properties– Resources (e.g. host, version)– Performance properties (e.g.
latency)– Arguments (e.g. URL, args,
SQL statement)– Success/failure
request
RequestID: 1Seq Num: 1Name: AHost: xxLatency: 10msSuccess: true…..D
E
Path
A
A B
C D
E F
ROC Retreat 1/14/2003
Related Work Micro debugging tools
– RootCause provides extensible logging of method calls and arguments.
– Diduce look for inconsistencies in variable usage.– Complements macro analysis tools.
Languages for monitoring– InfoSpect looks for inconsistencies in system state
using a logic language
Network flow-based monitoring– RTFM and Cisco NetFlow classify and record network
flows
Statistical and data mining languages– S, DMQL, WebML
ROC Retreat 1/14/2003
Visualization Techniques Tainted paths: mark all flows
that have a certain property (e.g. failed or slow) with a distinct color and overlay it on the graph
Detecting performance bottlenecks: look for replicated nodes that have different colors
Detecting anomaly: look for missing edges and unknown paths
ROC Retreat 1/14/2003
Pinpoint Framework
Communications Layer
(Tracing & Internal F/D)
A B C
Components
#1
Requests
ExternalF/D
#2
#3
StatisticalAnalysis
DetectedFaults
1,A1,C2,B..
1, success2, fail3, ...
Logs
ROC Retreat 1/14/2003
Experimental Setup Demo app: J2EE Pet Store
– e-commerce site w/~30 components
Load generator– replay trace of browsing– Approx. TPCW WIPSo load (~50% ordering)
Fault injection parameters– Trigger faults based on combinations of used
components– Inject exceptions, infinite loops, null calls
55 tests with single-components faults and interaction faults– 5-min runs of a single client (J2EE server limitation)
ROC Retreat 1/14/2003
Application Observations
0%
20%
40%
60%
80%
100%
1 6 11 16 21
# of components
Cu
mu
lati
ve
% o
f to
tal r
eq
ue
sts
0
2
4
6
8
10
12
cart
productIt
emList
Invento
ryEJB
SignO
nEJB
ClientC
ontrolle
rEJB
ShoppingCartEJB
# o
f ti
gh
tly
cou
ple
d c
om
po
nen
ts
# of components used in a dynamic web page request: – median 14, min 6, max 23
large number of tightly coupled components that are always used together
ROC Retreat 1/14/2003
Metrics
Precision: C/P Recall: C/A Accuracy: whether all actual faults are
correctly identified (recall == 100%)– boolean measure
PredictedFaults (P)
Actual Faults (A)
CorrectlyIdentifiedFaults (C)
ROC Retreat 1/14/2003
4 Analysis Techniques Pinpoint: clusters of components that
statistically correlate with failures Detection: components where Java
exceptions were detected– union across all failed requests– similar to what an event monitoring system
outputs Intersection: intersection of components
used in failed requests Union: union of all components used in
failed requests
ROC Retreat 1/14/2003
Results
Pinpoint has high accuracy with relatively high precision
76%
50%
83%
100%
33%
77%
15%11%
0%
20%
40%
60%
80%
100%
Pinpoint Detection Intersection Union
average accuracy
average precision
0%
20%
40%
60%
80%
100%
1 2 3 4
# of Interacting Components
Av
era
ge
Ac
cu
rac
y
Pinpoint
Detection
Intersection
Union
ROC Retreat 1/14/2003
Pinpoint Prototype Limitations Assumptions
– client requests provide good coverage over components and combinations
– requests are autonomous (don’t corrupt state and cause later requests to fail)
Currently can’t detect the following:– faults that only degrade performance– faults due to pathological inputs
Single-node only
ROC Retreat 1/14/2003
Current Status Simple graph visualization
ROC Retreat 1/14/2003
Proposed Research 3 classes of large network systems
– Clustered Internet systems• Tiered architecture, high bandwidth, many replicas
– Peer-to-peer (P2P) systems, including sensor networks
• Widely distributed nodes, dynamic membership
– Sensor networks• Limited storage, processing, and bandwidth.
ROC Retreat 1/14/2003
P2P Systems: Tracing Trace messages by piggybacking the
current node name to the messages Tracing overhead
– Assume 32-bit per node name and a very conservative log2(N) hops for each msg and
– Data overhead is 40% for a 1500-byte message in a 106-node system
ROC Retreat 1/14/2003
P2P Systems: Implementation Verification
Current debugging techniques: lots of printf()’s on each node and manually correlate the paths taken by messages
How do you know the messages are delivered as specified by the algorithms?
Use message paths to check for routing invariants– detect extra hops, loops, and verify that the paths are
correct
Can implement as a query:– select length from paths where (length > log2(N))