problem diagnosis distributed problem diagnosis sherlock x-trace
TRANSCRIPT
![Page 1: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/1.jpg)
Problem Diagnosis
• Distributed Problem Diagnosis
• Sherlock
• X-trace
![Page 2: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/2.jpg)
Troubleshooting Networked Systems
• Hard to develop, debug, deploy, troubleshoot• No standard way to integrate debugging,
monitoring, diagnostics
![Page 3: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/3.jpg)
Status quo: device centric
...
...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:38 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire...28 03:55:39 PM fire......
...
...[04:03:23 2006] [notice] Dispatch s1...[04:03:23 2006] [notice] Dispatch s2...[04:04:18 2006] [notice] Dispatch s3...[04:07:03 2006] [notice] Dispatch s1...[04:10:55 2006] [notice] Dispatch s2...[04:03:24 2006] [notice] Dispatch s3...[04:04:47 2006] [crit] Server s3 down.........
...
... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......
...
... 72.30.107.159 - - [20/Aug/2006:09:12:58 -0700] "GET /ga65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /rob65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal65.54.188.26 - - [20/Aug/2006:09:13:32 -0700] "GET /gal66.249.72.163 - - [20/Aug/2006:09:15:04 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:07 -0700] "GET /ga66.249.72.163 - - [20/Aug/2006:09:15:10 -0700] "GET /ro66.249.72.163 - - [20/Aug/2006:09:15:11 -0700] "GET /ga......
...
...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid...LOG: statement: SELECT COU...LOG: statement: SELECT g2_...LOG: statement: select oid.........
Firewall
Load Balancer
Web 1
Web 2
Database
![Page 4: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/4.jpg)
Status quo: device centric
• Determining paths:– Join logs on time and ad-hoc identifiers
• Relies on – well synchronized clocks– extensive application knowledge
• Requires all operations logged to guarantee complete paths
![Page 5: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/5.jpg)
Examples
5
User
DNS Server
Proxy
Web Server
![Page 6: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/6.jpg)
Examples
6
User
DNS Server
Proxy
Web Server
![Page 7: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/7.jpg)
Examples
7
User
DNS Server
Proxy
Web Server
![Page 8: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/8.jpg)
Examples
8
User
DNS Server
Proxy
Web Server
![Page 9: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/9.jpg)
Approaches to Diagnosis
• Passively learn the relationships– Infer problems as deviations from the norm
• Actively Instrument the stack to learn relationships– Infer problems as deviations from the norm
![Page 10: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/10.jpg)
Sherlock – Diagnosing Problems in the Enterprise
Srikanth Kandula
![Page 11: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/11.jpg)
Well-Managed Enterprises Still Unreliable
10% Troubled
85% Normal
Fraction Of Requests
0.7% Down
.1
.02
.04
.06
.08
10 100 1000 10000
Response time of a Web server (ms)
0
10% responses take up to 10x longer than normal
How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?
![Page 12: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/12.jpg)
Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems
Sherlock
![Page 13: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/13.jpg)
Challenges for the End-to-End Approach
• Don’t know what user’s performance depends on
![Page 14: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/14.jpg)
• Don’t know what user’s performance depends on– Dependencies are distributed
– Dependencies are non-deterministic
• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10
packets, but which affected user?
SQLBackend
Web Server
Auth. Server
DNS
Client
E.g., Web Connection
Challenges for the End-to-End Approach
![Page 15: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/15.jpg)
Sherlock’s Contributions
• Passively infers dependencies from logs• Builds a unified dependency graph incorporating
network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise
![Page 16: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/16.jpg)
Sherlock’s Architecture
![Page 17: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/17.jpg)
Servers
Clients
Sherlock’s Architecture
Web1 1000ms
Web2 30ms
File1 Timeout
User Observations+
=
List Troubled Components
Network Dependency Graph
Inference Engine
Sherlock works for various client-server applications Sherlock works for various client-server applications
![Page 18: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/18.jpg)
Video Server
Data Store
DNS
How do you automatically learn such distributed dependencies?
![Page 19: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/19.jpg)
Strawman: Instrument all applications and libraries
Sherlock exploits timing info
Time
My Client talks to B
t
My Client talks to C
If talks to B, whenever talks to C Dependent Connections
Not Practical
![Page 20: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/20.jpg)
Sherlock exploits timing info
Time
t
BBB B BB
False Dependence
BC
If talks to B, whenever talks to C Dependent Connections
Strawman: Instrument all applications and libraries Not Practical
![Page 21: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/21.jpg)
Sherlock exploits timing info
Time
If talks to B, whenever talks to C Dependent Connections
t
BB C
Inter-access timeDependent iff t << Inter-access time
As long as this occurs with probability higher than chance
Strawman: Instrument all applications and libraries Not Practical
![Page 22: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/22.jpg)
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing
Video
DNS
Store
Dependency Graph
![Page 23: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/23.jpg)
Bill’s Client StoreDNS
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
• Works with legacy applications• Adapts to changing conditions
Dependency Graph
Video
DNS
Store
![Page 24: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/24.jpg)
But hard dependencies are not enough…
![Page 25: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/25.jpg)
Bill’s Client StoreDNS
Video Store
Video
Bill watches Video
Bill DNS Bill Video
But hard dependencies are not enough…
Need Probabilities
p1
p3
If Bill caches server’s IP DNS down but Bill gets video
Sherlock uses the frequency with which a dependence occurs in logs as its edge probability
p2p1=10% p2=100%
![Page 26: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/26.jpg)
How do we use the dependency graph to diagnose user problems?
![Page 27: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/27.jpg)
Bill’s Client StoreDNS
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
Which components caused the problem?
Need to disambiguate!!
Diagnosing User Problems
![Page 28: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/28.jpg)
Bill’s Client StoreDNS
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
Diagnosing User Problems
Which components caused the problem?
Bill Sees Sales
Sales
Bill Sales
Paul Watches Video2
Paul Video2
Video2 Store
Video2
Use correlation to disambiguate!!• Disambiguate by correlating
– Across logs from same client– Across clients
• Prefer simpler explanations
![Page 29: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/29.jpg)
Will Correlation Scale?
![Page 30: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/30.jpg)
Corporate Core
Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices
Building Network
Campus Core
Data Center
Dependency Graph is Huge
![Page 31: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/31.jpg)
Can we evaluate all combinations of component failures?
The number of fault combinations is exponential!
Impossible to compute!
Will Correlation Scale?
![Page 32: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/32.jpg)
Scalable Algorithm to Correlate
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Only a few faults happen concurrently
Exponential PolynomialExponential Polynomial
![Page 33: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/33.jpg)
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Scalable Algorithm to Correlate
Only a few faults happen concurrently
Only few nodes change state
Exponential PolynomialExponential Polynomial
![Page 34: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/34.jpg)
Re-evaluate only if an ancestor changes state
Reduces the cost of evaluating a case by 30x-70x
Reduces the cost of evaluating a case by 30x-70x
Exponential PolynomialExponential Polynomial
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Only a few faults happen concurrently
Only few nodes change state
Scalable Algorithm to Correlate
![Page 35: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/35.jpg)
Results
![Page 36: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/36.jpg)
Experimental Setup
• Evaluated on the Microsoft enterprise network
• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets
• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center
![Page 37: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/37.jpg)
What Do Web Dependencies in the MS Enterprise Look Like?
![Page 38: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/38.jpg)
Auth. Server
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal
![Page 39: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/39.jpg)
Auth. Server
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal
![Page 40: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/40.jpg)
Auth. Server
Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal Client Accesses Sales
![Page 41: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/41.jpg)
What Do File-Server Dependencies Look Like?
Client Accesses Software Distribution Server
Auth.Server
WINS DNS
Backend Server 1
Backend Server 2
Backend Server 3
Backend Server 4
ProxyFile Server
100%10% 6% 5% 2%
8%
5%
1%.3%
Sherlock works for many client-server applicationsSherlock works for many client-server applications
![Page 42: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/42.jpg)
Dependency Graph: 2565 nodes; 358 components that can fail
Sherlock Identifies Causes of Poor Performance
Com
pone
nt In
dex
Time (days)87% of problems localized to 16 components87% of problems localized to 16 components
![Page 43: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/43.jpg)
Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail
Corroborated the three significant faults
Com
pone
nt In
dex
Time (days)
![Page 44: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/44.jpg)
• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes
Sherlock Goes Beyond Traditional Tools
Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!
![Page 45: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/45.jpg)
![Page 46: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/46.jpg)
X-Trace
• X-Trace records events in a distributed execution and their causal relationship
• Events are grouped into tasks– Well defined starting event and all that is
causally related• Each event generates a report, binding it to
one or more preceding events• Captures full happens-before relation
![Page 47: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/47.jpg)
X-Trace Output
• Task graph capturing task execution – Nodes: events across layers, devices– Edges: causal relations between events
IP IP Router
IP RouterIP
TCP 1Start
TCP 1End
IP IP Router IP
TCP 2Start
TCP 2End
HTTPProxy
HTTPServer
HTTPClient
![Page 48: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/48.jpg)
• Each event uniquely identified within a task: [TaskId, EventId]
• [TaskId, EventId] propagated along execution path• For each event create and log an X-Trace report
– Enough info to reconstruct the task graph
Basic Mechanism
IP IP Router
IP RouterIP
TCP 1Start
TCP 1End
IP IP Router IP
TCP 2Start
TCP 2End
HTTPProxy
HTTPServer
HTTPClient
f hb
a g
m
n
c d e i j k l
[T, g][T, a]
[T, a]X-Trace ReportTaskID: TEventID: gEdge: from a, f
X-Trace ReportTaskID: TEventID: gEdge: from a, f
![Page 49: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/49.jpg)
X-Trace Library API
• Handles propagation within app• Threads / event-based (e.g., libasync)• Akin to a logging API:
– Main call is logEvent(message)• Library takes care of event id creation,
binding, reporting, etc• Implementations in C++, Java, Ruby,
Javascript
![Page 50: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/50.jpg)
Task Tree
• X-Trace tags all network operations resulting from a particular task with the same task identifier
• Task tree is the set of network operations connected with an initial task
• Task tree could be reconstruct after collecting trace data with reports
52
![Page 51: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/51.jpg)
An example of the task tree
• A simple HTTP request through a proxy
53
![Page 52: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/52.jpg)
X-Trace Components
• Data– X-Trace metadata
• Network path– Task tree
• Report– Reconstruct task tree
54
![Page 53: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/53.jpg)
Propagation of X-Trace Metadata
• The propagation of X-Trace metadata through the task tree
55
![Page 54: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/54.jpg)
Propagation of X-Trace Metadata
• The propagation of X-Trace metadata through the task tree
56
![Page 55: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/55.jpg)
The X Trace metadata
Field Usage
Flags Bits that specify which of the three optional components are present
TaskID An unique integer ID
TreeInfo ParentID, OpID, EdgeType
Destination Specify the address that X-Trace report should be sent to
Options Accommodate future extensions mechanism
57
![Page 56: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/56.jpg)
X-Trace Report Architecture
58
![Page 57: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/57.jpg)
X-Trace Report Architecture
59
![Page 58: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/58.jpg)
X-Trace Report Architecture
60
![Page 59: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/59.jpg)
X-Trace-like in Google/Bing/Yahoo
• Why?– Own large portion of the ecosystem– Use RPC for communication– Need to understand
• Time for user request• Resource utilization by request
![Page 60: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/60.jpg)
Sherlock V X-trace
• Overhead V. Accuracy
• Deployment issues– Invasiveness– Code modification
![Page 61: Problem Diagnosis Distributed Problem Diagnosis Sherlock X-trace](https://reader034.vdocuments.us/reader034/viewer/2022051116/56649eb45503460f94bbc165/html5/thumbnails/61.jpg)
Conclusions
• Sherlock passively infers network-wide dependencies from logs and traceroutes
• It diagnoses faults by correlating user observations
• X-trace actively discovers network-wide dependencies