hansel: diagnosing faults in...

45
Hansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC Berkeley Kshiteej Mahajan U Wisconsin‒Madison Vijay Mann IBM Research Mohan Dhawan IBM Research

Upload: others

Post on 14-Aug-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

Hansel: Diagnosing Faults in OpenStack

Dhruv SharmaUC San Diego

Rishabh PoddarUC Berkeley

Kshiteej MahajanU Wisconsin‒Madison

Vijay MannIBM Research

Mohan DhawanIBM Research

Page 2: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Cloud Management System (CMS)

● Complex distributed systems

– Intelligent orchestration of tasks, like instance creation, deletion, migration, etc.

– Communication via REST and RPC

2

Page 3: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Cloud Management System (CMS)

● Complex distributed systems

– Intelligent orchestration of tasks, like instance creation, deletion, migration, etc.

– Communication via REST and RPC

● Popular CMSes include

– Apache CloudStack

– VMware vSphere

– OpenStack

2

Page 4: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

User Horizon

3

Page 5: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

User Horizon Nova

3

Page 6: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

Nova

Keystone

User Horizon

3

Page 7: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

Nova

Keystone Glance

User Horizon

3

Page 8: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

Nova

Keystone Glance

User Horizon

3

Neutron

Page 9: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

User Horizon Nova

Keystone NeutronGlance

VM

3

Page 10: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Example Workflow: Launch VM

User Horizon Nova

Keystone NeutronGlance

VM

3

Complexity and non-determinism in task executions often results in subtle problems hard to diagnose

Page 11: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Outline

● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion

4

Page 12: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Inconsistent Errors

● Attach instance to external network

– Gets scheduled for spawning

5

Page 13: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Inconsistent Errors

● Attach instance to external network

– Something went wrong!

5

Page 14: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

What went wrong?

● Networking service (Neutron) returns no valid interface bindings

– Neutron RPCs throw an exception

– Neutron REST calls to controller service (Nova) merely list binding status as failed

● Nova incorrectly interprets the REST response

– Has no access to the Neutron RPC exceptions

● Horizon reports a generic error

– Sufficient to mislead even skilled developers and operators

6

Page 15: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

What went wrong?

● Networking service (Neutron) returns no valid interface bindings

– Neutron RPCs throw an exception

– Neutron REST calls to controller service (Nova) merely list binding status as failed

● Nova incorrectly interprets the REST response

– Has no access to the Neutron RPC exceptions

● Horizon reports part of the error

– Sufficient to mislead even skilled developers and operators

Several such cross-component interactions possibly resulting in hard to diagnose problems

6

Page 16: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Outline

● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion

7

Page 17: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Our Approach

● Non-intrusive network monitoring of OpenStack messages

Node A Node B

RPC/REST

A

B

A'

8

Stitch related contexts

Extract message context

Page 18: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Key Idea: Execution Sequence

● OpenStack messages are transitions that drive system from one state to another

– REST requests create new inter-component state● Indicates transitions across components

– RPCs and REST responses create updated state● RPCs update state for the same component● REST responses update state for the component initiating the request

A B B' A'

RESTREQUEST RPCs RESTRESPONSE

9

Page 19: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Key Idea: Execution Sequence

● OpenStack messages are transitions that drive system from one state to another

– REST requests create new inter-component state● Indicates transitions across components

– RPCs and REST responses create updated state● RPCs update state for the same component● REST responses update state for the component initiating the request

A B A'

RESTREQUEST RESTRESPONSE

RPCs

9

Page 20: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Key Idea: Execution Sequence

● OpenStack messages are transitions that drive system from one state to another

– REST requests create new inter-component state● Indicates transitions across components

– RPCs and REST responses create updated state● RPCs update state for the same component● REST responses update state for the component initiating the request

A B A'

RESTREQUEST RESTRESPONSE

RPCs

Key Issues

(1) Accurate modelling of the state managed at each component based on just the OpenStack network messages

(2) Correct placement of REST and RPC transitions connecting the different component states

9

Page 21: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

State Modelling

● OpenStack messages propagate resource identifiers across components

– UUID (e.g., cae70ca0-7d55-11e5-aa58-0002a5d5c51b)

● Create a message context based on resource identifiers and other message metadata

● Component state at a given time is an aggregation of such message contexts

VM Tenant Network Storage

10

Page 22: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Transition Placement

● Cluster related RPCs at B using request-id

11

A B A'

RESTREQUEST RESTRESPONSE

RPCs

Page 23: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Transition Placement

● Cluster related RPCs at B using request-id

● RESTREQUEST carries the request-id for ensuing RPCs

11

A B A'

RESTREQUEST RESTRESPONSE

RPCs

Page 24: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Transition Placement

● Cluster related RPCs at B using request-id

● RESTREQUEST carries the request-id for ensuing RPCs

● RESTRESPONSE can be paired based on protocol and port

11

A B A'

RESTREQUEST RESTRESPONSE

RPCs

Page 25: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Transition Placement

● Cluster related RPCs at B using request-id

● RESTREQUEST carries the request-id for ensuing RPCs

● RESTRESPONSE can be paired based on protocol and port

11

A B A'

RESTREQUEST RESTRESPONSE

RPCs

Transition placement reduces to finding states A and A'

Page 26: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Transition Placement

● Cluster related RPCs at B using request-id

● RESTREQUEST carries the request-id for ensuing RPCs

● RESTRESPONSE can be paired based on protocol and port

11

A B A'

RESTREQUEST RESTRESPONSE

RPCs

A→B: Θ(ARESTREQ

) ∪ Θ(ARPC) ⊇ Θ(BRESTREQ

)A' = A Θ(B∪ REST

RESP))

RESTREQUEST

RPCs

Page 27: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Transition Placement

● Cluster related RPCs at B using request-id

● RESTREQUEST carries the request-id for ensuing RPCs

● RESTRESPONSE can be paired based on protocol and port

11

B

RESTREQUEST RESTRESPONSE

RPCs

A→B: Θ(ARESTREQ

) ∪ Θ(ARPC) ⊇ Θ(BRESTREQ

)A' = A ∪ Θ(BREST

RESP)

A A'

Page 28: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Fault Diagnosis

● Domain specific checks

– Light weight regular expressions to detect errors in OpenStack REST and RPC messages

● On detection of a fault

– Mark corresponding error state in execution sequence

– Backtrack to determine the sequence of operations leading to the fault

12

Page 29: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Outline

● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion

13

Page 30: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Architecture

EventReception

TemporalOrdering

Transaction Stitching

Analyzer Service

RPC/REST

...

OpenStack Nodes

Extract Message Context

N/W Monitoring Agent

14

Page 31: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Implementation

● Prototype of Hansel for OpenStack Juno

– Network monitoring via Bro/Broccoli● Intercept OpenStack REST and RPC messages● RabbitMQ analyzer for Bro (~60 LOC C++)

– Multi-threaded analyzer service● Implements event reception, temporal ordering, and transaction

stitching (~800 LOC Python)

– Several optimizations and heuristics to improve precision● State space conflation for RPCs● Purge buckets for stale execution sequences

15

Page 32: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Outline

● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion

16

Page 33: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Experimental Setup

● Physical setup of three tiered datacenter topology with 14 switches, 7 servers and 3 compute nodes

● Measure Hansel's accuracy, precision and performance

● Tempest integration test suite for OpenStack

– Provides ~1K realistic scenarios involving several nodes

– 709 tests work for our setup

17

Category Tests Txns (K)Events

States (K)RPC (K) REST (K) Error Drop

Image 100 3.7 28.0 16.4 50 1 2.0

Management 218 5.9 52.3 26.6 125 3 4.8

Network 109 2.2 15.5 9.0 45 0 1.2

Storage 58 1.0 3.5 3.9 25 0 0.4

VM 24 8.2 87.7 37.0 124 173 5.8

Total 709 21.0 187.0 92.7 369 177 14.2

Page 34: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Accuracy

● Applied Hansel to several test cases

– Detected root cause in each case

● Delete VM image while saving a snapshot

18

Page 35: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Precision

● η = (N-n)/(N-1)

– N = Total possible parent nodes

– n = Hansel identified parent nodes

19

Page 36: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Precision

● η = (N-n)/(N-1)

– N = Total possible parent nodes

– n = Hansel identified parent nodes

Hansel has high precision >98%

19

Page 37: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Effectiveness of Optimizations (I)

● RPC state space conflation

20

Page 38: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Effectiveness of Optimizations (I)

● RPC state space conflation

Hansel achieves average reduction of ~61%

20

Page 39: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Effectiveness of Optimizations (II)

● Impact of purging stale execution sequences

21

Page 40: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Effectiveness of Optimizations (II)

● Impact of purging stale execution sequences

No observable impact on precision

21

Page 41: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Performance Overheads

● Communication pipeline

Matches send rate till 1.6K

22

Page 42: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Performance Overheads

● Communication pipeline

Matches send rate till 1.6K

Broccoli impacts event processing rate

22

Page 43: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Outline

● Overview● Motivation● Hansel● Implementation● Evaluation● Conclusion

23

Page 44: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Conclusion

● Debugging OpenStack faults is hard

● We present Hansel, which

– Leverages non-intrusive network monitoring to expedite fault diagnosis in OpenStack operations

– Examines relevant network communication to mine unique identifiers

– Stitches together related communication to construct a stateful trail of control flow amongst the component nodes

● Hansel is fast, accurate, and precise even under stress

24

Page 45: Hansel: Diagnosing Faults in OpenStackconferences2.sigcomm.org/co-next/2015/img/papers/hansel.pdfHansel: Diagnosing Faults in OpenStack Dhruv Sharma UC San Diego Rishabh Poddar UC

2/12/15 CoNEXT'15

Thank You.

Contact: [email protected]

25