pip detecting the unexpected in distributed systems janet wiener jeff mogul mehul shah chip killian...
DESCRIPTION
page 3 Pip - November 2005 Motivation Three target audiences: Primary programmer – Debugging or optimizing his/her own system Secondary programmer – Inheriting a project or joining a programming team – Learning how the system behaves Operator – Monitoring running system for unexpected behavior – Performing regression tests after a changeTRANSCRIPT
![Page 1: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/1.jpg)
PipDetecting the Unexpected in
Distributed SystemsJanet
WienerJeff Mogul
Mehul Shah
http://issg.cs.duke.edu/pip/[email protected]
Chip KillianAmin
Vahdat
Patrick Reynolds
![Page 2: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/2.jpg)
page 2Pip - November 2005
Motivation
• Distributed systems exhibit complex behaviors• Some behaviors are unexpected
– Structural bugs• Placement or timing of processing and communication
– Performance problems• Throughput bottlenecks• Over- or under-consumption of resources• Unexpected interdependencies
• Parallel, inter-node behavior is hard to capture with serial, single-node tools– Not captured by traditional debuggers, profilers– Not captured by unstructured log files
![Page 3: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/3.jpg)
page 3Pip - November 2005
Motivation
Three target audiences:• Primary programmer
– Debugging or optimizing his/her own system• Secondary programmer
– Inheriting a project or joining a programming team– Learning how the system behaves
• Operator– Monitoring running system for unexpected
behavior– Performing regression tests after a change
![Page 4: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/4.jpg)
page 4Pip - November 2005
Motivation
• Programmers wish to examine and check system-wide behaviors– Causal paths– Components of end-to-end
delay– Attribution of resource
consumption• Unexpected behavior might indicate a bug
Web server
App server
Database
500ms
2000page faults
![Page 5: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/5.jpg)
page 5Pip - November 2005
Pip overview
Pip:1. Captures events from a running
system2. Reconstructs behavior from events3. Checks behavior against expectations4. Displays unexpected behavior
• Both structure and resource violations
Goal: help programmers locate and explain bugs
Behaviormodel
Application
Expectations
Pip checker
Unexpectedstructure
Resourceviolations
Pip explorer: visualization GUI
![Page 6: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/6.jpg)
page 6Pip - November 2005
Outline
• Expressing expected behavior• Building a model of actual behavior• Exploring application behavior• Results
– FAB– RanSub– SplitStream
![Page 7: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/7.jpg)
page 7Pip - November 2005
Describing application behavior
• Application behavior consists of paths– All events, on any node, related to one high-level
operation– Definition of a path is programmer defined– Path is often causal, related to a user request
WWWApp server
DB
Parse HTTP
Query
Send responseRun application
time
![Page 8: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/8.jpg)
page 8Pip - November 2005
Describing application behavior
• Within paths are tasks, messages, and notices– Tasks: processing with start and end points– Messages: send and receive events for any
communication• Includes network, synchronization (lock/unlock), and
timers– Notices: time-stamped strings; essentially log
entriesWWW
App serverDB
Parse HTTP
Query
Send responseRun application
time
“Request = /cgi/…” “2096 bytes in response”“done with request 12”
![Page 9: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/9.jpg)
page 9Pip - November 2005
Expectations: Recognizers
• Application behavior consists of paths• Each recognizer matches paths
– A path can match more than one recognizer• A recognizer can be a validator, an invalidator, or
neither• Any path matching zero validators or at least one
invalidator is unexpected behavior: bug?validator CGIRequest task(“Parse HTTP”) limit(CPU_TIME, 100ms); notice(m/Request URL: .*/); send(AppServer); recv(AppServer);invalidator DatabaseError notice(m/Database error: .*/);
![Page 10: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/10.jpg)
page 10Pip - November 2005
Expectations: Recognizers language• repeat: matches a ≤ n ≤ b copies of a
block
• xor: matches any one of several blocks
• call: include another recognizer (macro)• future: block matches now or later
– done: force named block to match
repeat between 1 and 3 { … }
xor {branch: …branch: …
}
future F1 { … }…done(F1);
![Page 11: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/11.jpg)
page 11Pip - November 2005
Expectations: Aggregate expectations
• Recognizers categorize paths into sets• Aggregates make assertions about sets of paths
– Count, unique count, resource constraints– Simple math and set operators
assert(instances(CGIRequest) > 4);assert(max(CPU_TIME, CGIRequest) < 500ms);assert(max(REAL_TIME, CGIRequest) <= 3*avg(REAL_TIME, CGIRequest));
![Page 12: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/12.jpg)
page 12Pip - November 2005
Outline
• Expressing expected behavior• Building a model of actual behavior• Exploring application behavior• Results
![Page 13: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/13.jpg)
page 13Pip - November 2005
Building a behavior model
Sources of events:• Annotations in source code
– Programmer inserts statements manually• Annotations in middleware
– Middleware inserts annotations automatically– Faster and less error-prone
• Passive tracing or interposition– Easier, but less information
• Or any combination of the above
Model consists of paths constructed from events recorded by the running application
![Page 14: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/14.jpg)
page 14Pip - November 2005
Annotations
• Set path ID• Start/end task• Send/receive message• Notice
WWWApp server
DB
Parse HTTP
Query
Send responseRun application
time
“Request = /cgi/…” “2096 bytes in response”“done with request 12”
![Page 15: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/15.jpg)
page 15Pip - November 2005
Automating expectations and annotations
• Expectations can be generated from behavior model– Create a recognizer for each actual
path– Eliminate repetition– Strike a balance between over- and
under-specification• Annotations can be generated
by middleware• Automatic annotations in Mace,
Sandstorm, J2EE, FAB– Several of our test systems use
Mace annotations
Behaviormodel
Application
Expectations
Pip checker
Annotations
Unexpectedbehavior
![Page 16: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/16.jpg)
page 16Pip - November 2005
Checking expectations
Traces
Categorized paths
Reconciliation
Events database
Paths
Path construction
Expectation checking
Application
For each path P For each recognizer R Does R match P?Check each aggregate
Expectations
Match start/end task, send/receive messageOrganize events into causal paths
![Page 17: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/17.jpg)
page 17Pip - November 2005
Exploring behavior
• Expectations checker generates lists of valid and invalid paths
• Explore both sets– Why did invalid paths occur?– Is any unexpected behavior misclassified as
valid?• Insufficiently constrained expectations• Pip may be unable to express all expectations
• Two ways to explore behavior– SQL queries over tables
• Paths, threads, tasks, messages, notices– Visualization
![Page 18: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/18.jpg)
page 18Pip - November 2005
Timing and resource properties for one taskCausal view of path
Visualization: causal paths
Caused tasks, messages, and notices on that thread
![Page 19: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/19.jpg)
page 19Pip - November 2005
Visualization: communication graph
• Graph view of all host-to-host network traffic
![Page 20: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/20.jpg)
page 20Pip - November 2005
Visualization: performance graphs
• Plot per-task or per-path resource metrics– Cumulative distribution (CDF), probability density (PDF), or vs. time
• Click on a point to see its value and the task/path represented
Time (s)
Dela
y (m
s)
![Page 21: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/21.jpg)
page 21Pip - November 2005
Pip vs. printf
• Both record interesting events to check off-line– Pip imposes structure and automates checking– Generalizes ad hoc approaches
Pip printfNesting, causal order UnstructuredTime, path, and thread No contextCPU and I/O data No resource informationAutomatic verification using declarative language
Verification with ad hoc grep or expect scripts
SQL queries “Queries” using Perl scripts
Automatic generation for some middleware
Manual placement
![Page 22: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/22.jpg)
page 22Pip - November 2005
Results
• We have applied Pip to several distributed systems:– FAB: distributed block store– SplitStream: DHT-based multicast protocol– RanSub: tree-based protocol used to build higher-
level systems– Others: Bullet, SWORD, Oracle of Bacon
• We have found unexpected behavior in each system
• We have fixed bugs in some systems… and used Pip to verify that the behavior was fixed
![Page 23: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/23.jpg)
page 23Pip - November 2005
Results: SplitStream (DHT-based multicast protocol)
13 bugs found, 12 fixed– 11 found using expectations, 2 found using GUI
• Structural bug: some nodes have up to 25 children when they should have at most 18– This bug was fixed and later reoccurred– Root cause #1: variable shadowing– Root cause #2: failed to register a callback
• How discovered: first in the explorer GUI, confirmed with automated checking
![Page 24: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/24.jpg)
page 24Pip - November 2005
Results: FAB (distributed block store)
1 bug (so far), fixed– Four protocols checked: read, write, Paxos, membership
• Performance bug: nodes seeking quorum call self and peers in arbitrary order– Should call self last, to overlap computation– For cached blocks, should call self second-to-last
![Page 25: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/25.jpg)
page 25Pip - November 2005
Results: RanSub (tree-based protocol)
2 bugs found, 1 fixed• Structural bug: during first round of communication, parent nodes send summary messages before hearing from all children– Root cause: uninitialized state variables
• Performance bug: linear increase in end-to-end delay for the first ~2 minutes– Suspected root cause: data structure listing all
discovered nodes
![Page 26: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/26.jpg)
page 26Pip - November 2005
Future work
•Further automation of annotations, tracing– Explore tradeoffs between black-box,
annotated behavior models•Extensible annotations
– Application-specific schema for notices•Composable expectations for large systems
![Page 27: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/27.jpg)
page 27Pip - November 2005
Related work
• Expectations-based systems– PSpec [Perl, 1993]– Meta-level compilation [Engler, 2000]– Paradyn [Miller, 1995]
• Causal paths– Pinpoint [Chen, 2002]– Magpie [Barham, 2004]– Project5 [Aguilera, 2003]
• Model checking– MaceMC [Killian, 2006]– VeriSoft [Godefroid, 2005]
![Page 28: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/28.jpg)
page 28Pip - November 2005
Conclusions
• Finding unexpected behavior can help us find bugs– Both structure and performance bugs
• Expectations serve as a high-level external specification– Summary of inter-component behavior and timing– Regression test for structure and performance
• Some bugs not exposed by expectations can be found through exploring: queries and visualization
![Page 29: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/29.jpg)
Extra slides
![Page 30: Pip Detecting the Unexpected in Distributed Systems Janet Wiener Jeff Mogul Mehul Shah Chip Killian Amin](https://reader035.vdocuments.us/reader035/viewer/2022062503/5a4d1af17f8b9ab05997ea3b/html5/thumbnails/30.jpg)
page 30Pip - November 2005
Resource metrics
• Real time• User time, system time
– CPU time = user + system– Busy time = CPU time / real time
• Major and minor page faults (paging and allocation)
• Voluntary and involuntary context switches• Message size and latency• Number of messages sent• Causal depth of path• Number of threads, hosts in path