reducing risk by managing software related failures in networked control systems girish baliga,...
TRANSCRIPT
Reducing Risk by Managing Software Related Failures in
Networked Control Systems
Girish Baliga, Google, Inc
Scott Graham, Air Force Inst. of Technology (AFIT)
Carl A. Gunter, Dept. of Computer Science, UIUC
P. R. Kumar, Dept. of ECE and CSL, UIUC
Information TechnologyConvergence Lab Vision Sensors
Automatic Control
Ad Hoc Network
Planning and Scheduling
Networked Control Systems
Network
Sensor 1
Supervisor
Controller 2
Actuator 1
Plant 1
Sensor 2Actuator 2
Plant 2Controller 1
Filter 1
Software related failures
Programming errors– Simple errors such as incorrect storage size can be catastrophic– E.g. Arianne 5 failure was due to overflow in a 16 bit integer variable!
Passive failures– Software, node, and link failures can cut-off sub-systems– E.g. Car controller failures can cause a car to collide with other cars
Active failures– Faulty software can interfere with other sub-systems– E.g. Car controller or sensor errors can cause car collisions
Byzantine failures– Malicious agents can actively interfere with system operation– E.g. Rogue cars can try to block intersections and collide with other cars
Preventing software related failures
Robust control laws– Control laws can be designed to tolerate software failures– But, errors could exist in control law implementations!
Software verification using formal methods– Formal methods could be used to verify software implementations– But, failures could occur in systems software, libraries, hardware, or links– Also, software verification is very hard for large systems
Presence of software errors must be a basic assumption in system design
Controller
Plant
Component based design
Control system design
Supervisor
Sensor Actuator
Plant
Controller
Component based design
Component based software design isolates programming errors
Virtual Collocation
Etherware (Baliga & Kumar ‘03)Etherware manages all software components in a networked control system
Etherware– Location
independence– Semantic
addressing of components
– System startup and upgrade during execution
– Time translation– Automatic
migration of components for performance
Etherware manages software failures– Quick and efficient component restarts
– Maintain interconnections across failures
Transport Layer
Network Layer
MAC
Physical Layer
Application Layer
Se
rvic
e 2
Se
rvic
e 3
Tim
ing
Discrete Event
Scheduler
Kalman filter
TrajectoryPlanner
Car
controller
Model PredictiveController
Set PointGeneration
ImageProcessing
Control LawOptimization
Sensor Controller
MessageStream
Message streams connect software components- Message streams are setup and managed automatically by
Etherware- Message streams are persisted across component restarts
Etherware mechanisms formanaging software related failures
Kalman Filter
Filter
Filters intercept messages- Filters can be added to components and message streams - Filters can be used to manage component interactions
Local temporal autonomy
VisionSensor 2
VisionSensor 1
VisionServer
Supervisor
Controller 1
Actuator 1
State estimator
Stateestimator
ControlbufferLocal temporal autonomy reduces component
dependencies to tolerate passive failures
CA Supervisor
CA Filter
Collation
VisionSensor 2
VisionSensor 1
VisionServer
Supervisor
Controller
Actuator
Collation of multiple independent inputs safeguards from active failures
Security Supervisor
CA Supervisor
CA Filter
Security overrides
VisionSensor 2
VisionSensor 1
VisionServer
Supervisor
Controller
Actuator
Override
Security overrides are used to manage Byzantine failures
- Security overrides must preserve low-level safety mechanisms
Safety preserving security overrides
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
Conclusions
Presence of software failures - basic assumption of systems design
– Component based design isolates failures
– Etherware provides mechanisms to manage software failures
– Design principles to manage risk due to software failures:» Component based design to contain programming errors» Local temporal autonomy to tolerate passive failures» Collation to safeguard from active failures» Safety preserving security overrides to manage Byzantine failures