towards highly reliable enterprise network services via inference of multi-level dependencies
DESCRIPTION
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies. Defense by Chen, Jiazhen & Chen, Shiqi. Introduction. Using a network-based service can be a frustrating experience…. - PowerPoint PPT PresentationTRANSCRIPT
Towards Highly Reliable Enterprise Network Services via Inference of
Multi-level Dependencies
Defense by
Chen, Jiazhen & Chen, Shiqi
Introduction
• Using a network-based service can be a frustrating experience…
Even inside the network of a single enterprise, where traffic does not cross the open Internet, user-perceptible service degradations are rampant.
Introduction
When 13% of the requests take 10x longer than normal and acceptable time to process, it’s time for us to do something.
Introduction
• Conventional approach: up OR down
• Our approach: up, troubled or down.
Response fall significantly outside of normal response time, including the case when only a subset of service requests are performing poorly.
Sherlock
• Detects the existence of faults and performance problems by monitoring the response time of services.
• Determines the set of components that could be responsible.
• Localizes the problem to the most likely component.
Inference Graph and Ferret Algorithm
• Inference Graph: Captures the dependencies between all components of the IT infrastructure.
Determine the set of component
• Ferret Algorithm: Efficiently localizes faults in enterprise-scale networks using the Inference graph and measurements of service response times made by agents.
Localize the problem to the most likely component
Tradeoffs
• Sherlock falls short of full problem diagnosis.• It has not been evaluated in an environment that
dependencies are deliberately and frequently changed.
• …
• BUT, measurements indicate that the vast majority of enterprise applications do not fall into these classes.
Inference Graph in detail
Node structure
• Root-cause node: physical components whose failure can cause an end-user to experience failures. (two special root-cause nodes: AT and AD)
• Observation node: accesses to network services whose performances can be measured by Sherlock.
• Meta-nodes: model the dependencies between root causes and observations.
Node State
• The state of each node is expressed by a three-tuple: (Pup, Ptrouble, Pdown)
• Pup + Ptrouble + Pdown = 1
• The state of the root-cause node is independent of any other node in the Inference Graph.
Edges
• Edge from node A to B: the dependency that node A has to be in the up state for node B to be up.
• Each edge is given a dependency probability, indicating how strong the dependency is.
Meta-Nodes and the Propagation of State
Noisy-Max Meta-Nodes
• Max: The node gets the worst condition of its parents.
• Noisy: if the weight of a parent’s edge is d, then with probability (1-d) the child is not affected by that parent.
Meta-Nodes and the Propagation of State
Selector Meta-Nodes• Used to model load balancing scenarios.
• NLB: Network Load Balancer
• ECMP: A commonly used technique in enterprise networks where routers send packets to a destination along several paths.
Meta-Nodes and the Propagation of State
Failover Meta-Nodes
• Failover: Clients access primary production servers and failover to backup servers when the primary server is inaccessible.
Calculation of State Propagation
For Noisy-Max Mega-Nodes:
Time Complexity: O(n) instead of O(3n).
Calculation of State Propagation
For Selector and Failover meta-nodes:
• Still needs O(3n) time.
• HOWEVER, those two types of meta-nodes have no more than 6 parents.
Fault Localization
• Assignment-vector: An assignment of state to every root-cause node in the Inference Graph where the root-cause node has probability 1 of being either up, troubled or down.
• Our target: Find the assignment-vector that can explain the observation best.
Ferret Algorithm
• Basically we need to compute a score for how well the vector explains the observations for EACH vector, and give the most possible results in sorting order.
• But there are 3r possible vectors for r root-cause nodes, how to search through all these vectors?
Ferret Algorithm
• IMPORTANT OBSERVATION:
It is very likely that at any point in time only a few root-cause nodes are troubled or down.
• Ferret Algorithm evaluates the vectors with no more than k nodes troubled or down. So it processes at most (2*r)k vectors.
Ferret algorithm
• The approximation error of Ferret algorithm becomes vanishingly small for k=4 and onwards
Ferret Algorithm
• OBSERVATION: Since a root-cause is assigned to be up in
most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of root-cause nodes that are not up.
• So Ferret Algorithm can be further speeded up.
Ferret Algorithm
Score for a given assignment vector:• Track the history of response time and fits two
Gaussian distribution to the data, namely Gaussianup and Gaussiantroubled.
• If the observation time is t and the predicted observation node is (pup, ptroubled, pdown), then the score of this vector is calculated as:
pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)
Ferret Algorithm
• We use a statistical test to determine if the prediction is sufficiently meaningful to deserve attention.
• Null hypothesis: all root causes are up
• Score(best prediction) – Score(null hypothesis)
The Sherlock System
Step1: Discover Service-Level Dependencies
Step2: Construct the Inference Graph
Step3: Fault Localization using Ferret
Remark: Sherlock requires no changes to routers, applications, or middlewares used in the enterprise.
Discovering Service-Level Dependencies
• OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co-occur.
• Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B.
• Dependency interval: 10ms• Chance of co-occurrence: 10ms/I. We retain
only the dependency probability which is much greater than the chance of co-occurrence.
Aggregating Probabilities across Clients
• OBSERVATION: many clients in an enterprise network have similar host, software and network configuration and are likely to have similar dependencies.
• Aggregation allows us to deal with infrequent service and false dependencies better.
Constructing the Inference Graph
A: Create a noisy-max meta-node to represent the service. And create an observation node for each client.
Constructing the Inference Graph
B: Identify the set of services Ds that the clients are dependent on accessing S. And recurs.
Constructing the Inference Graph
C: Create a root-cause node to represent the host on which the service runs and makes this root-cause a parent of the meta-node.
Constructing the Inference Graph
D: Add network topology information to the Inference Graph. Add a noisy-max meta node to represent the path and a root-cause nodes to represent router and link.
Constructing the Inference Graph
E: Finally put AT and AD into the Inference Graph. Give each the edges connecting AT/AD to the observation point a weight 0.001. And give the edges between a router and a path meta-node a probability 0.999.
Implementation
• Implement Sherlock Agent as user-level daemon– Windows XP– Pcap-based sniffer
Implementation
• Centralized inference Engine – Aggregate information easily– Scalability
• Extremely large network (105 agents) – bandwidth Req.10Mbps
Implementation
• Computational Complexity of fault localization– Linearly with graph size
Evaluation
• Deploy Sherlock system
in our enterprise network– 40 servers
– 34 routers
– 54 IP links
– 2 LANs
– 3 Weeks
• Agents periodically send requests to the web and file servers– Mimic user behavior
• Controlled environments: Testbed and simulation
Evaluation – Discovering Service Dependencies
• Evaluate Sherlock’s algorithm – discovering Service-level dependency graphsdiscovering Service-level dependency graphs
• Two example:– Web Server– File Server
• In the graph:– Arrows:
• Point from server provides service to server rely on it.– Edges :
• Represent strength of the dependencies with weights
Evaluation – Discovering Service Dependencies
• Service-level dependency graphs for visiting web portal and sales website– Clients depend on name lookup servers to access
websites– Both websites share substantial portions of their back-
end dependencies
Evaluation – Discovering Service Dependencies
• Service-level dependency graphs for visiting file server– Turns out file server is the root name of a distributed file system
Evaluation – Discovering Service Dependencies
• Summary– Service-level dependencies is various– Similar dependencies have different strength– Dependencies change over the time
Evaluation – Discovering Service Dependencies
• Number of samples– For about 4,000 samples, converged.– Once converged, service-level dependencies are
stable over several days to weeks.
Evaluation – Discovering Service Dependencies
• Number of clients– Sherlock aggregates samples from more clients– Aggregate 20 clients reduce false-dependency greatly
• Evaluate “Ferret”– Ability to localize faults– Sensitivity to errors
Evaluation – Localizing Faults
• Testbeds– Three web servers, – SQL server, – Authentication server
• Failed or Overloaded:– Overloaded link: 5% of packets drop– Overloaded server: high CPU and disk utilization
Evaluation – Localizing Faults
• Service-level dependencies– Arrows at the bottom level
Evaluation – Inference Graph
Features of Ferret
• Localizes faults – Correlating observations from multiple points
• No “Threshold” setting• Clients observation of web server’s performance
– Probabilistic Correlation Algorithm for reasoning
• List of suspects of being the root cause
• Handle multiple simultaneous failures
Fault Localization Capability
• Affected by – The number of vantage points– The number of observations available in time window
• Having observations from the same time period that exercise all paths in the Inference Graph.
Fault Localization Capability
• In time window of 10 seconds– At least 70 unique clients access every one of the top
20 servers– Top 20 servers alone provide enough observations to
localize faults
Evaluation - Sherlock Field Deployment
• The resulting Inference Graph contains: – 2565 nodes – 358 components– Can fail independently
• Experiment:– Running Sherlock system over 120 hours– Found 1029 performance problems
Evaluation - Sherlock Field Deployment
• The result– X: time – Y: one component
Evaluation - Sherlock Field Deployment
• Ferret’s computation:– 87% of the problems are caused by 16 components– The Server1, Server2, and Link on R1 for instance
Features of Sherlock
• Discover problems overlooked by using traditional threshold-based techniques.
– Requests requiring interaction between the web server and the SQL backend experience poor performance
• No Threshold-based Alerts raised• Sherlock caught it!
Features of Sherlock
• Discover problems overlooked by using traditional threshold-based techniques.
– No threshold setting that can catches this problem while avoiding false alarms
Sherlock vs. prior approaches
• Multi-level inference graph vs. Two-level
• Probabilistic dependencies
• Need a large set of observations that the actual root cause of problems are known.– Simulation!
Sherlock vs. prior approaches
• Conduct Experiment with simulations– Known topology and Inference Graph
• Randomly set state of each root cause• Repeat 1,000 times produced 1,000 sets of observations
– Compare different techniques’ ability • To identify the correct root cause given the 1000
observation• Up to 32% more faults found than Shrink and
Score
Sherlock vs. prior approaches
– Compare different techniques’ ability • To identify the correct root cause given the 1000
observation• Up to 32% more faults found than Shrink and
Score
Sherlock vs. prior approaches
– Compare different techniques’ ability • Shrink: using Two-level inference graph• SCORE: using minimal set cover
– Dependency either exist or not
Time to Localize Faults
• Experiment:– AMD 1.8 GHz, 1.5GB RAM– Inference Graph:
• 500,000 nodes• 2,300 clients• 70 servers
• Ferret takes 24 minutes to localize an injected fault
Time to Localize Faults
• Running time – Ferret is always less than 4ms times the number of nodes in the
Inference Graph
• Ferret is easily parallelizable
Impact of -Errors in Inference Graph
• Errors are unavoidable when constructing inference graphs
– How sensitive Ferret is to these errors?
• Experiment:1. Randomly add new parents
2. Randomly swap one of its parents with another node
3. Randomly change weight on edges
4. Randomly add extra hop or permute its intermediate hops
– First three types: perturbation to errors in dependency– Last type: errors in traceroute
Impact of -Errors in Inference Graph
• From observation:
• Each point represents the average of 1000 experiment– When half of paths/nodes/weights are perturbed
• Ferret correctly localizes faults in 74.3% of the case – Perturbing the paths seems to be most impact
Modeling Redundancy Tech.
• Modeling load-balancing and redundancy– Using specialized meta-node
• Experiment:– Create 24 failure scenarios
• Where the root cause is a component connected to a specialized meta-node
– Test Ferret by using • Inference graphs with specialized meta-nodes• Inference graphs with noisy-max meta-nodes instead
– Result • 14 cases about secondary server or backup path14 cases about secondary server or backup path
– No difference for both kind of inference graph
• 10 cases about primary server or path failed10 cases about primary server or path failed– Using specialized meta-nodes, Ferret got 10 totally– Without that, Ferret got 4 wrong root cause identification
Summary of Evaluation
• Ferret algorithm is able to discover service dependencies in a few hours
• Since enterprise network might contains huge number of nodes, we justify the need to an automatic approach
• Effectively identifying performance problems and narrowing the root-cause
• Sherlock is robust to noise, compare to other approachs.
Discussion – future worksfuture works
• Virtual Machines– Many enterprises put multiple servers on a
single piece of hardware via VM.
– We can let all the algorithms in this paper, treat each VM as a separate host.
Discussion – future worksfuture works
• Software Failures– Software Failures not modeled
– Extending our inference graph to these common failures modes
– This is our next step plan
Contributions
• Main contribution– Sherlock system
• Assist IT Admin for troubleshooting.
– We contribute• A Multi-level probabilistic inference model• Technique for Automatic Construction of the
Inference Graph• An algorithm with Inference Graph to localize root
cause.
Thanks~
AnyAnyQuestion?Question?