performance debugging for distributed systems of black boxes marcos k. aguilera jeffrey c. mogul...
TRANSCRIPT
![Page 1: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/1.jpg)
Performance Debugging for
Distributed Systems of Black Boxes
Marcos K. AguileraJeffrey C. MogulJanet L. Wiener
HP Labs Patrick Reynolds, Duke
Athicha Muthitacharoen, MIT
WISP 200411 November 2004
![Page 2: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/2.jpg)
page 220 October 2003 Project5 - SOSP
Example multi-tier system
client
web server
client
web serverweb server
database server
database server
application server
application server
authentication server
![Page 3: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/3.jpg)
page 320 October 2003 Project5 - SOSP
Motivation
• Complex distributed systems are built from black box components
• These systems may have performance problems
• High or erratic latency
• Locating the causes of these problems is hard
• We can’t always examine or modify system components
• We need tools to infer where bottlenecks are
• Choose which black boxes to open
![Page 4: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/4.jpg)
page 420 October 2003 Project5 - SOSP
Contributions of our work
• Tools to highlight which black boxes have problems
• Require only passive information, such as packet traces
• Infer where most of time is spent from traces
• Person can then use more invasive tools to examine those boxes
• Reduce time and cost to debug complex systems
• Improve quality of delivered systems
![Page 5: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/5.jpg)
page 520 October 2003 Project5 - SOSP
Example causal path
client
web server
client
web serverweb server
database server
database server
application server
application server
authentication server
100ms
![Page 6: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/6.jpg)
page 620 October 2003 Project5 - SOSP
Goals of our tools
• Find high-impact causal paths through a distributed systemCausal path: series of nodes that sent/received
messages– Each message is caused by receipt of previous
message– Some causal paths occur many times
High impact:– Occurs frequently– Contributes significantly to overall latency
• Without modifications or semantic knowledge• Report per-node latencies on causal paths
![Page 7: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/7.jpg)
page 720 October 2003 Project5 - SOSP
Overview of our approach
• Obtain traces of messages between components– Ethernet packets, middleware messages, etc.– Collect traces as non-invasively as possible
• Analyze traces using algorithms
• Visualize results and highlight high-impact paths
• Requires very little information: [timestamp, source, destination]
![Page 8: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/8.jpg)
page 820 October 2003 Project5 - SOSP
Outline
• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Related work• Conclusions
![Page 9: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/9.jpg)
page 920 October 2003 Project5 - SOSP
The convolution algorithm: input
Time From To
0.01 A B
0.02 A B
0.04 B D
0.05 C F
...
![Page 10: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/10.jpg)
page 1020 October 2003 Project5 - SOSP
The convolution algorithm: output
A
C DB
E FE FFE
G G G G G G
.15.10 0 0
.10.10
0 0 0 0 00
![Page 11: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/11.jpg)
page 1120 October 2003 Project5 - SOSP
Basic idea
• Creates a “time signal” for messages from each node
• Given time signals S1(t)=(AB) and S2(t)=(BX)
Computes convolution of S2(t) and S1(–t) = S1 * S2
(can be computed quickly using fast fourier transforms)
S1(t)=(AB msgs)
1 2 3 4 5 6 7 time
![Page 12: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/12.jpg)
page 1220 October 2003 Project5 - SOSP
S1(t)=(AB msgs)
S2(t)=(BX msgs)
S1 * S2=conv(S2(t), S1(-t))
• Spikes suggest causality between ASpikes suggest causality between AB and BB and BX msgsX msgs• Time shift of a spike indicates its characteristic delayTime shift of a spike indicates its characteristic delay
![Page 13: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/13.jpg)
page 1320 October 2003 Project5 - SOSP
Details: first step
• Choose starting node A• Use trace to add edges from it
Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B
A
B C
![Page 14: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/14.jpg)
page 1420 October 2003 Project5 - SOSP
Continuing
Time From To … B D … B E … B F … B G
A
B C
??
(AB)*(BE) (AB)*(BD)
d
![Page 15: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/15.jpg)
page 1520 October 2003 Project5 - SOSP
How
Time From To t1 A B t2 A B t3 A B t4 A B
Time From To
… t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D
![Page 16: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/16.jpg)
page 1620 October 2003 Project5 - SOSP
Heuristic to find spikes
threshold 1: n1 stddev over meanthreshold 2: n2 stddev over mean
n1 = 2n2 = 1.5
![Page 17: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/17.jpg)
page 1720 October 2003 Project5 - SOSP
Recursing to continue
• Observations: 1. (BD) are not all msgs from B to D (only those caused by A)
2. Stop recursion when too few messages left or no more spikes found
A
B
D
d
??
![Page 18: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/18.jpg)
page 1820 October 2003 Project5 - SOSP
Outline
• Problem statement & goals• Overview of our approach• Algorithm• Experimental results• Conclusions
![Page 19: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/19.jpg)
page 1920 October 2003 Project5 - SOSP
Results: email service delays
• Jeff logged all email headers for two months • Parsed 80K Received headers in 12K messagesReceived: from cceexg11.americas.cpqcorp.net ...by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800
– Yields (timestamp, sender, receiver) trace records• Used Convolution Algorithm to
– Reconstruct message paths– Find typical delays
• Note: this is NOT the most direct way to use email headers– We made the problem harder so as to test our algorithm
![Page 20: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/20.jpg)
page 2020 October 2003 Project5 - SOSP
Email trace: output
60
39 6737
40 3840 383840
41 41 41 41 41 41
4890,15
7380,10
4600
7660 523
0,10
7680,10
4780
5940
4390
6260
5120
6350
![Page 21: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/21.jpg)
page 2120 October 2003 Project5 - SOSP
Results: Petstore
• Sun’s demo application for J2EE
• Stanford’s PinPoint project provided us with traces– One trace has a node that is
artificially slowed down
![Page 22: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/22.jpg)
page 2220 October 2003 Project5 - SOSP
Future work
•Automate trace gathering and conversion•Sliding-window versions of algorithms
– Find phased behavior– Reduce memory usage of nesting algorithm– Improve speed of convolution algorithm
•Validate usefulness on more complicated systems
•What are limits of our approach?
![Page 23: Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha](https://reader036.vdocuments.us/reader036/viewer/2022070412/5697bf751a28abf838c806aa/html5/thumbnails/23.jpg)
page 2320 October 2003 Project5 - SOSP
Conclusion
• Looking for bottlenecks in black box systems
• Use signal processing techniques to find causal pathsin the network and its delays
• For more information– http://www.hpl.hp.com/research/project5/
• Contact us if you have multi-hop message traces!