distributed debugging
DESCRIPTION
Distributed Debugging. Presenter: Chi-Hung Lu. Problems. Distributed applications are hard to validate Distribution of application state across many distinct execution environments Protocols involve complex interactions among a collection of networked machines - PowerPoint PPT PresentationTRANSCRIPT
X-Trace: A Pervasive Network Tracing Framework
Distributed DebuggingPresenter: Chi-Hung Lu11ProblemsDistributed applications are hard to validateDistribution of application state across many distinct execution environmentsProtocols involve complex interactions among a collection of networked machinesNeed to handle failures ranging from network problems to crashing nodesIntricate sequences of events can trigger complex errors as a result of mishandled corner cases2ApproachesLogging-based DebuggingX-TraceBi-directional Distributed BackTracker (BDB)PipDeterministic ReplayWiDSFridayJockeyModel CheckingMaceMC3X-Trace: A Pervasive Network Tracing FrameworkR. Fonseca et al, NSDI 074Problem DescriptionIt is difficult to diagnose the source of the problem for an internet applicationCurrent network diagnostic tools only focus on one particular protocolDoes not share information on the application between the user, service, and the network operators5ExamplestracerouteCould locate IP connectivity problemCould not reveal proxy or DNS failuresHTTP monitoring suiteCould locate application problemCould not diagnose routing problems6Examples7
UserDNS ServerProxyWeb ServerExamples8
UserDNS ServerProxyWeb ServerExamples9
UserDNS ServerProxyWeb ServerExamples10
UserDNS ServerProxyWeb ServerX-TraceAn integrated tracing frameworkRecord the network path that were takenInvoke X-Trace when initiating an application taskInsert X-Trace metadata with a task identifier in the requestPropagate the metadata down to lower layers through protocol interfaces
11Task TreeX-Trace tags all network operations resulting from a particular task with the same task identifierTask tree is the set of network operations connected with an initial taskTask tree could be reconstruct after collecting trace data with reports12An example of the task treeA simple HTTP request through a proxy
13X-Trace ComponentsDataX-Trace metadataNetwork pathTask treeReportReconstruct task tree14Propagation of X-Trace MetadataThe propagation of X-Trace metadata through the task tree
15Propagation of X-Trace MetadataThe propagation of X-Trace metadata through the task tree
16The X Trace metadataFieldUsageFlagsBits that specify which of the three optional components are presentTaskIDAn unique integer IDTreeInfoParentID, OpID, EdgeTypeDestinationSpecify the address that X-Trace report should be sent toOptionsAccommodate future extensions mechanism
17Operation of X-Trace Metadata
18Operation of X-Trace Metadata
19X-Trace Report Architecture
20X-Trace Report Architecture
21X-Trace Report Architecture22
Usage Scenario (1)Web request and recursive DNS queries
23Usage Scenario (2)A request fault annotated with user input
24Usage Scenario (3)A client and a server communicate over I3 overlay network
25Usage Scenario (3)Internet Indirect Infrastructure (I3)26
Usage Scenario (3)Internet Indirect Infrastructure (I3)27
Usage Scenario (3)Internet Indirect Infrastructure (I3)28
Usage Scenario (3)Tree for normal operation
29Usage Scenario (3)The receiver host fails
30Usage Scenario (3)Middlebox process crash
31Usage Scenario (3)The middlebox host fails
32DiscussionReport lossNon-tree request structuresPartial deploymentManaging report trafficSecurity Considerations33WiDS Checker: Combating Bugs in Distributed SystemsX. Liu et al, NSDI 0734Problem DescriptionLog mining is both labor-intensive and fragileLatent bugs often are distributed across multiple nodesLogs reflect incomplete information of an executionNon-determinism of distributed application
35GoalsEfficiently verify application propertiesProvide fairly complete information about an executionReproduce the buggy runs deterministically and faithfully36ApproachLog the actual execution of a distributed system
Apply predicate checking in a centralized simulator over a run driven by testing scripts or replayed by logs
Output violation report along with message tracesAn execution is interpreted as a sequence of events, which are dispatched to corresponding handling routines37ComponentsA versatile script languageAllow a developer to refine system properties into straightforward assertionsA checkerInspect for violations38ArchitectureComponents of WiDS Checker
39ArchitectureReproduce real runsLog all non-deterministic events using Lamports logical clockCheck user-defined predicatesA versatile scription language to specify system states being observed and the predicates for invariants and correctnessScreen out false alarms with auxiliary informationFor liveness propertiesTrace root causes using a visualization tool40Programming with WiDSWiDS APIs are mostly member function of the WiDSObject classWiDS runtime maintains an event queue to buffer pending events and dispatches them to corresponding handling routines41Enabling ReplayLoggingLog all WiDS nondeterminismRedirect OS calls and log the resultsEmbed a Lamport Clock in each out-going messageCheckpointSupport partial replaySave the WiDS process contextReplayStart from the beginning or a checkpointReplay events in serialized Lamport order
42CheckerObserve memory stateDefine states and evaluate predicatesRefresh database for each eventMaintain historyRe-evaluate modified predicatesAuxiliary information for violationsLiveness properties only guarantee to be true eventually43
44
45
46Visualization ToolsMessage flow graph
47EvaluationBenchmark and result summary
48PerformanceRunning time for evaluating predicates
49Logging OverheadPercentage of logging time
50DiscussionSystem is debugged by those who developed itBugs are hunted by those who are intimately familiar with the system51