hansel: diagnosing faults in...
TRANSCRIPT
Hansel: Diagnosing Faults in OpenStack
Dhruv SharmaUC San Diego
Rishabh PoddarUC Berkeley
Kshiteej MahajanU Wisconsin‒Madison
Vijay MannIBM Research
Mohan DhawanIBM Research
2/12/15 CoNEXT'15
Cloud Management System (CMS)
● Complex distributed systems
– Intelligent orchestration of tasks, like instance creation, deletion, migration, etc.
– Communication via REST and RPC
2
2/12/15 CoNEXT'15
Cloud Management System (CMS)
● Complex distributed systems
– Intelligent orchestration of tasks, like instance creation, deletion, migration, etc.
– Communication via REST and RPC
● Popular CMSes include
– Apache CloudStack
– VMware vSphere
– OpenStack
2
2/12/15 CoNEXT'15
Example Workflow: Launch VM
User Horizon
3
2/12/15 CoNEXT'15
Example Workflow: Launch VM
User Horizon Nova
3
2/12/15 CoNEXT'15
Example Workflow: Launch VM
Nova
Keystone
User Horizon
3
2/12/15 CoNEXT'15
Example Workflow: Launch VM
Nova
Keystone Glance
User Horizon
3
2/12/15 CoNEXT'15
Example Workflow: Launch VM
Nova
Keystone Glance
User Horizon
3
Neutron
2/12/15 CoNEXT'15
Example Workflow: Launch VM
User Horizon Nova
Keystone NeutronGlance
VM
3
2/12/15 CoNEXT'15
Example Workflow: Launch VM
User Horizon Nova
Keystone NeutronGlance
VM
3
Complexity and non-determinism in task executions often results in subtle problems hard to diagnose
2/12/15 CoNEXT'15
Outline
● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion
4
2/12/15 CoNEXT'15
Inconsistent Errors
● Attach instance to external network
– Gets scheduled for spawning
5
2/12/15 CoNEXT'15
Inconsistent Errors
● Attach instance to external network
– Something went wrong!
5
2/12/15 CoNEXT'15
What went wrong?
● Networking service (Neutron) returns no valid interface bindings
– Neutron RPCs throw an exception
– Neutron REST calls to controller service (Nova) merely list binding status as failed
● Nova incorrectly interprets the REST response
– Has no access to the Neutron RPC exceptions
● Horizon reports a generic error
– Sufficient to mislead even skilled developers and operators
6
2/12/15 CoNEXT'15
What went wrong?
● Networking service (Neutron) returns no valid interface bindings
– Neutron RPCs throw an exception
– Neutron REST calls to controller service (Nova) merely list binding status as failed
● Nova incorrectly interprets the REST response
– Has no access to the Neutron RPC exceptions
● Horizon reports part of the error
– Sufficient to mislead even skilled developers and operators
Several such cross-component interactions possibly resulting in hard to diagnose problems
6
2/12/15 CoNEXT'15
Outline
● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion
7
2/12/15 CoNEXT'15
Our Approach
● Non-intrusive network monitoring of OpenStack messages
Node A Node B
RPC/REST
A
B
A'
8
Stitch related contexts
Extract message context
2/12/15 CoNEXT'15
Key Idea: Execution Sequence
● OpenStack messages are transitions that drive system from one state to another
– REST requests create new inter-component state● Indicates transitions across components
– RPCs and REST responses create updated state● RPCs update state for the same component● REST responses update state for the component initiating the request
A B B' A'
RESTREQUEST RPCs RESTRESPONSE
9
2/12/15 CoNEXT'15
Key Idea: Execution Sequence
● OpenStack messages are transitions that drive system from one state to another
– REST requests create new inter-component state● Indicates transitions across components
– RPCs and REST responses create updated state● RPCs update state for the same component● REST responses update state for the component initiating the request
A B A'
RESTREQUEST RESTRESPONSE
RPCs
9
2/12/15 CoNEXT'15
Key Idea: Execution Sequence
● OpenStack messages are transitions that drive system from one state to another
– REST requests create new inter-component state● Indicates transitions across components
– RPCs and REST responses create updated state● RPCs update state for the same component● REST responses update state for the component initiating the request
A B A'
RESTREQUEST RESTRESPONSE
RPCs
Key Issues
(1) Accurate modelling of the state managed at each component based on just the OpenStack network messages
(2) Correct placement of REST and RPC transitions connecting the different component states
9
2/12/15 CoNEXT'15
State Modelling
● OpenStack messages propagate resource identifiers across components
– UUID (e.g., cae70ca0-7d55-11e5-aa58-0002a5d5c51b)
● Create a message context based on resource identifiers and other message metadata
● Component state at a given time is an aggregation of such message contexts
VM Tenant Network Storage
10
2/12/15 CoNEXT'15
Transition Placement
● Cluster related RPCs at B using request-id
11
A B A'
RESTREQUEST RESTRESPONSE
RPCs
2/12/15 CoNEXT'15
Transition Placement
● Cluster related RPCs at B using request-id
● RESTREQUEST carries the request-id for ensuing RPCs
11
A B A'
RESTREQUEST RESTRESPONSE
RPCs
2/12/15 CoNEXT'15
Transition Placement
● Cluster related RPCs at B using request-id
● RESTREQUEST carries the request-id for ensuing RPCs
● RESTRESPONSE can be paired based on protocol and port
11
A B A'
RESTREQUEST RESTRESPONSE
RPCs
2/12/15 CoNEXT'15
Transition Placement
● Cluster related RPCs at B using request-id
● RESTREQUEST carries the request-id for ensuing RPCs
● RESTRESPONSE can be paired based on protocol and port
11
A B A'
RESTREQUEST RESTRESPONSE
RPCs
Transition placement reduces to finding states A and A'
2/12/15 CoNEXT'15
Transition Placement
● Cluster related RPCs at B using request-id
● RESTREQUEST carries the request-id for ensuing RPCs
● RESTRESPONSE can be paired based on protocol and port
11
A B A'
RESTREQUEST RESTRESPONSE
RPCs
A→B: Θ(ARESTREQ
) ∪ Θ(ARPC) ⊇ Θ(BRESTREQ
)A' = A Θ(B∪ REST
RESP))
RESTREQUEST
RPCs
2/12/15 CoNEXT'15
Transition Placement
● Cluster related RPCs at B using request-id
● RESTREQUEST carries the request-id for ensuing RPCs
● RESTRESPONSE can be paired based on protocol and port
11
B
RESTREQUEST RESTRESPONSE
RPCs
A→B: Θ(ARESTREQ
) ∪ Θ(ARPC) ⊇ Θ(BRESTREQ
)A' = A ∪ Θ(BREST
RESP)
A A'
2/12/15 CoNEXT'15
Fault Diagnosis
● Domain specific checks
– Light weight regular expressions to detect errors in OpenStack REST and RPC messages
● On detection of a fault
– Mark corresponding error state in execution sequence
– Backtrack to determine the sequence of operations leading to the fault
12
2/12/15 CoNEXT'15
Outline
● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion
13
2/12/15 CoNEXT'15
Architecture
EventReception
TemporalOrdering
Transaction Stitching
Analyzer Service
RPC/REST
...
OpenStack Nodes
Extract Message Context
N/W Monitoring Agent
14
2/12/15 CoNEXT'15
Implementation
● Prototype of Hansel for OpenStack Juno
– Network monitoring via Bro/Broccoli● Intercept OpenStack REST and RPC messages● RabbitMQ analyzer for Bro (~60 LOC C++)
– Multi-threaded analyzer service● Implements event reception, temporal ordering, and transaction
stitching (~800 LOC Python)
– Several optimizations and heuristics to improve precision● State space conflation for RPCs● Purge buckets for stale execution sequences
15
2/12/15 CoNEXT'15
Outline
● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion
16
2/12/15 CoNEXT'15
Experimental Setup
● Physical setup of three tiered datacenter topology with 14 switches, 7 servers and 3 compute nodes
● Measure Hansel's accuracy, precision and performance
● Tempest integration test suite for OpenStack
– Provides ~1K realistic scenarios involving several nodes
– 709 tests work for our setup
17
Category Tests Txns (K)Events
States (K)RPC (K) REST (K) Error Drop
Image 100 3.7 28.0 16.4 50 1 2.0
Management 218 5.9 52.3 26.6 125 3 4.8
Network 109 2.2 15.5 9.0 45 0 1.2
Storage 58 1.0 3.5 3.9 25 0 0.4
VM 24 8.2 87.7 37.0 124 173 5.8
Total 709 21.0 187.0 92.7 369 177 14.2
2/12/15 CoNEXT'15
Accuracy
● Applied Hansel to several test cases
– Detected root cause in each case
● Delete VM image while saving a snapshot
18
2/12/15 CoNEXT'15
Precision
● η = (N-n)/(N-1)
– N = Total possible parent nodes
– n = Hansel identified parent nodes
19
2/12/15 CoNEXT'15
Precision
● η = (N-n)/(N-1)
– N = Total possible parent nodes
– n = Hansel identified parent nodes
Hansel has high precision >98%
19
2/12/15 CoNEXT'15
Effectiveness of Optimizations (I)
● RPC state space conflation
20
2/12/15 CoNEXT'15
Effectiveness of Optimizations (I)
● RPC state space conflation
Hansel achieves average reduction of ~61%
20
2/12/15 CoNEXT'15
Effectiveness of Optimizations (II)
● Impact of purging stale execution sequences
21
2/12/15 CoNEXT'15
Effectiveness of Optimizations (II)
● Impact of purging stale execution sequences
No observable impact on precision
21
2/12/15 CoNEXT'15
Performance Overheads
● Communication pipeline
Matches send rate till 1.6K
22
2/12/15 CoNEXT'15
Performance Overheads
● Communication pipeline
Matches send rate till 1.6K
Broccoli impacts event processing rate
22
2/12/15 CoNEXT'15
Outline
● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion
23
2/12/15 CoNEXT'15
Conclusion
● Debugging OpenStack faults is hard
● We present Hansel, which
– Leverages non-intrusive network monitoring to expedite fault diagnosis in OpenStack operations
– Examines relevant network communication to mine unique identifiers
– Stitches together related communication to construct a stateful trail of control flow amongst the component nodes
● Hansel is fast, accurate, and precise even under stress
24