automated bootstrapping of a fault-resilient in-band...

44
Automated Bootstrapping of A Fault-Resilient In-Band Control Plane Ermin Sakic , Amaury Van Bemten, Mirza Avdic, Wolfgang Kellerer Technical University Munich & Siemens Germany ACM SOSR 2020 San Jose, March 3, 2020

Upload: others

Post on 18-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Automated Bootstrapping of A

Fault-Resilient In-Band Control Plane

Ermin Sakic, Amaury Van Bemten,

Mirza Avdic, Wolfgang Kellerer

Technical University Munich & Siemens Germany

ACM SOSR 2020

San Jose, March 3, 2020

Page 2: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

INTRODUCTION

Page 3: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Strict requirements:

o QoS: Sub-ms hard real-time E2E delays

o Dependability: Control & data plane HA & reliability

o Topology dynamics: factory cell / work-piece

(de)-attachment

TSN group (802.1) standardizes industrial CP & DP:

o E.g., TAS (Qbv), Frame Pre-emption (Qbu),

FRER (CB), Policing (Qci) etc.

Industrial Networks Overview

Page 4: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Strict requirements:

o QoS: Sub-ms hard real-time E2E delays

o Dependability: Control & data plane HA & reliability

o Topology dynamics: factory cell / work-piece

(de)-attachment

TSN group (802.1) standardizes industrial CP & DP:

o E.g., TAS (Qbv), Frame Pre-emption (Qbu),

FRER (CB), Policing (Qci) etc.

o Centralized (CNC) and distributed stream reservation

TSN requires a highly-available CNC w/ in-band,

dynamically extensible CP & DP

Industrial Networks Overview

Page 5: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Industrial Network Topologies

Page 6: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Industrial Network Topologies

VirtuWind – Virtual and programmable industrial network prototype deployed in operational wind park - https://5g-ppp.eu/virtuwind/

Page 7: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Control Plane Design

Out-of-BandIn-Band

Page 8: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Control Plane Design

In-BandGoal of Bootstrapping:

Automated establishment of a functional and resilient

In-Band SDN control plane

Required:

Initial C2S and C2C connections

Control plane fault tolerance

Full topology available (no blocked ports!)

Network extensions

Compliant with current implementations

Constraints:

Switches know nothing about the controllers

Controllers know whitelisted IP addresses of

remote controllers (e.g., standardized)

Switches and controllers exchange PKI certificates

Page 9: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Control Plane Design

High-level steps:

1. Controllers distribute IP addresses to

switches from a common pool

2. Controllers provides each switch

with controller lists (e.g., OF)

3. Controllers establish control

channels to each switch (e.g., OF)

Page 10: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Resilience Requirements

CP: Must tolerate F out of 2F+1 Fail-Stop controller failures

DP: Must tolerate k element failures k+1 fully or maximally disjoint paths

Page 11: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Bootstrapping Co-Dependency

DP requires appropriate table rules

Rule configuration requires C2C

In-Band C2C requires DP connectivity

Break bootstrapping procedure into sub-phases

Fully Bootstrapped

Data Plane

Bootstrapped

Controllers

Part. Bootstrapped

Data Plane

Flow

Configurations

Page 12: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Design Overview

Contribution:Two automated bootstrapping schemes for a reliable multi-

controller in-band control plane

Hybrid Switch Approach (HSW): Assumes (R)STP

Hop-By-Hop Approach (HHC): No (R)STP

Page 13: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Why regard (R)STP?

+ Beneficial for effortless initial C2C connectivity

- Dimensioning the (R)STP-disable timer non-trivial

Delays in bootstrapping convergence

- Added complexity in the data plane:

Prone to additional failure vectors (YMMV)

Page 14: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

DESIGN OF

THE TWO SCHEMES

Page 15: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

System Initialization

HSW - (R)STP enabled:

standalone mode Heavy use of NORMAL port

in-band mode enabled

HHC - (R)STP unavailable:

secure mode

in-band mode disabled „generic“ OF rules

Page 16: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

System Initialization

HSW - (R)STP enabled:

standalone mode Heavy use of NORMAL port

in-band mode enabled

HHC - (R)STP unavailable:

secure mode

in-band mode disabled „generic“ OF rules

HHC: How to fight initial broadcast storms without (R)STP?

Police problematic C2C traffic

(ARP, TCP SYN, TCP SYN ACK)

Page 17: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW Phases 0 and 1

Page 18: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HHC Phases 0 and 1

Page 19: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW

(with (R)STP)HHC

(no (R)STP)

Output: Phases 0 and 1

Page 20: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Phase 2: Resilience Embedding

HSW (with (R)STP):

Step 2a: - Establish OF sessions FCFS, install initial rules, disable in-band rules

Step 2b: - Disable R(STP)

- Install resilient flow rules

HHC (no (R)STP):

Step 2a: - Establish OF sessions Hop-By-Hop, install tree flow rules

Step 2b: - Install resilient flow rules whenever possible

Page 21: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW Phase 2a

Page 22: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW Phase 2a

Page 23: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW Phase 2b

Page 24: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW Phase 2b

Page 25: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HHC Phase 2a

Page 26: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HHC Phase 2a

Page 27: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HHC Phase 2b

Page 28: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HHC Phase 2b

Page 29: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Phase 2: Outcome both schemes

k+1 max. disjoint paths

for C2C pairs

k+1 max. disjoint paths

for C2S pairs (here only S4)

Page 30: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Dynamic network extensions

Allow new traffic to reach the leader via tree HSW: Prim’s algorithm

HHC: Custom Hop-By-Hop Algorithm

Special rule: in_port=inactive port, udp, udp_src=68, actions=controller Extend tree by parsing DHCP DISCOVERY message

Page 31: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Data Plane Failures

Proactively compute alternative trees

Embed an alternative tree in case a DP element fails

Page 32: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

EVALUATION

Page 33: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Evaluation - KPIs

Global Bootstrapping Convergence Time (GBCT)

Network Extension Time (TEXT)

Flow Table Occupancy (FTO)

TOPOLOGY TYPES

TOPOLOGY SIZES

CONTROLLER PLACEMENTS

NUMBER OF CONTROLLERS

GBCT

TEXT

FTO

Page 34: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Global Bootstrapping Convergence Time

Single Controller

* normalized by minimum mean ~13.5s

Page 35: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Global Bootstrapping Convergence Time

Multiple Controllers

* normalized by minimum mean ~33.9s

Page 36: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Network Extension Time

Single Controller

* normalized by minimum mean ~6.5s

Page 37: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Network Extension Time

Multiple Controllers

* normalized by minimum mean ~33.5s

Page 38: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Flow Table Occupation

Ratios of per-switch FTOs, normalized

respective to the FTO in 1-controller case

Page 39: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

SUMMARY

Page 40: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

HSW - (R)STP

+ Straightforward; easier to implement

- Dependency on legacy protocols (and implementation)

- Worse performance due to (R)STP Timer

HHC - No (R)STP

+ Less legacy protocol dependencies

+ Faster on average

- Slightly more complex implementation

Summary - Pros and Cons

Page 41: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Artifacts and Future Updates

Source code for both approaches and Docker-based

OpenFlow emulator available!

https://github.com/ermin-sakic/sdn-automated-bootstrapping

Page 42: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Artifacts and Future Updates

Source code for both approaches and Docker-based

OpenFlow emulator available!

https://github.com/ermin-sakic/sdn-automated-bootstrapping

Page 43: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Artifacts and Future Updates

Potential optimizations:

Automated rule compression for lower FTO

Tree merging instead of swapping

Support for concurrent multi-controller bootstrapping?

(RAFT membership issues?)

Source code for both approaches and Docker-based

OpenFlow emulator available!

https://github.com/ermin-sakic/sdn-automated-bootstrapping

Page 44: Automated Bootstrapping of A Fault-Resilient In-Band ...conferences.sigcomm.org/sosr/2020/slides/Automated Bootstrappin… · Automated Bootstrapping of A Fault-Resilient In-Band

Selected References

Marco Canini, Iosif Salem, Liron Schiff, Elad M Schiller, and Stefan Schmid. 2017. A self-organizing distributed and in-band SDN control plane. In 2017

IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2656–2657.

Marco Canini, Iosif Salem, Liron Schiff, Elad Michael Schiller, and Stefan Schmid. 2018. Renaissance: A self-stabilizing distributed SDN control plane. In

2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 233–243.

Josef Dorr. 2018. IEC/IEEE P60802 JWG TSN Industrial Profile: Use Cases Status Update 2018-05-14. IEC/IEEE. https://1.ieee802.org/tsn/iec-ieee-

60802/

Peter Heise, Fabien Geyer, and Roman Obermaisser. 2017. Self-configuring deterministic network with in-band configuration channel. In Software

Defined Systems (SDS), 2017 Fourth International Conference on. IEEE, 162–167.

Liron Schiff, Stefan Schmid, and Marco Canini. 2016. Ground control to major faults: Towards a fault tolerant and adaptive SDN control network. In

Dependable Systems and Networks Workshop, 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 90–96.

Liron Schiff, Stefan Schmid, and Marco Canini. 2015. Medieval: Towards A Self-Stabilizing, Plug & Play, In-Band SDN Control Network. In ACM

Sigcomm Symposium on SDN Research (SOSR).

Sachin Sharma, Dimitri Staessens, Didier Colle, Mario Pickavet, and Piet Demeester. 2013. A demonstration of automatic bootstrapping of resilient

OpenFlow networks. In 13th IFIP/IEEE International Symposium on Integrated Network Management (IM). IEEE, 1066–1067.

Sachin Sharma, Dimitri Staessens, Didier Colle, Mario Pickavet, and Piet Demeester. 2013. Fast failure recovery for in-band OpenFlow networks. In

Design of Reliable Communication Networks (DRCN) 2013 9th International Conference on the. IEEE, 52–59.