experimental evaluation of system-level supervisory approach for sefis mitigation
DESCRIPTION
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation. Mrs. Shazia Maqbool and Dr. Craig I Underwood. 1. Maqbool. MAPLD 2005/P181. Overview. Context Motivation Mitigation Scheme Top Level Description Protocol Gets Defined Test Bed - PowerPoint PPT PresentationTRANSCRIPT
Experimental Evaluation of System-Level Supervisory
Approach for SEFIs Mitigation
Mrs. Shazia Maqbool and Dr. Craig I Underwood
Maqbool 1 MAPLD 2005/P181
Overview
• Context
• Motivation
• Mitigation Scheme Top Level Description Protocol Gets Defined Test BedExperimental Results
• Conclusions
Maqbool 2 MAPLD 2005/P181
Single Event Functional Interrupts (SEFIs)
• A type of anomaly in microcircuits caused by a single ion strike
• Occurs in sensitive cross-section of the device• User doesn’t have direct access to fault location• Signatures
An upset rate higher than expected Non responding device In a communication network SEFI is an event, which stops
communication Variations in device current consumption
• During a SEFI, device is unavailable to the system • Device is potentially recoverable
Recovery involves resetting or power cycling System recovery requires restoring the device functionality
followed by its state recovery
Maqbool 3 MAPLD 2005/P181
Mitigation Levels
Using Radiation Hardening Processes
Built in Fault Tolerance Features
Incorporating Redundancy within Device
Error Detection And Corrections (EDACs)
Redundancy Techniques, e.g. Voting, Lockstep etc.
Configuration Scrubbing
Data Handling Networks
Device Level
Unit Level
System/Architectural Level
Maqbool 4 MAPLD 2005/P181
Motivation• Space applications demand:
more…
Computational power
Standardization
Reusability
but less…
Mass, volume and power budget
Cost
Development time
• Candidate architectures are heavily based on state-of-the-art COTS technology
Reliability
Availability
• SEFIs and single event transients are becoming dominant radiation hazards
• A unit level approach has usually been considered for SEFI mitigation
Maqbool 5 MAPLD 2005/P181
System Architecture
P ayloadC om puter
S uperv isor
O B C
M assM em ory U n it
S paceW ireR outer
C ode S tore
P ow er S ubsystemC ontro lle r
• A fast data network interlinks all units
Scalable
Distributed
Reusable
•A system level SEFI mitigation
•A diagnosis and recovery (DAR) packet from each unit acts as an indicator of health status for the unit
•The supervisor intervenes when a packet does not arrive or it does not match expectation
Maqbool 6 MAPLD 2005/P181
Why a System-Level Approach
• Cost-effective
• Adaptable
• Reusable
• Power cycling requirements associated with SEFIs, demands for an external entity to hold state data and to initiate a recovery procedure
• In case of a permanent failure, it can be switched off
• Supervisory functions, network and configuration management can be combined
Maqbool 7 MAPLD 2005/P181
On-Board ComputerPossible sources of fault Processor Memory
Network interface
Processor
EDAC
Memory
Interface FPGA
OPC
The OBC Subsystem
Am plifie r
C om aprator
Vset
O BCM odule
AD C
InterfaceFPG A
D AC
I
Over-Current Protection Circuitry (OPC)
Required underlying mitigations
EDAC
OPC
Maqbool 8 MAPLD 2005/P181
SEFI SignaturesD evice Type S ignatures
M icroprocessor
ScreechN on responsive
C alcu la tion errorsC urrent Varia tions
M em ory U pset ra te h igher than expected
N etw ork in terface
Inva lid packetN on responsive
Link estab lishm ent tim e-outTransm it and rece ive tim e-out
Maqbool 9 MAPLD 2005/P181
Supervisory Protocol• Two Types of packets
Screech Packet Diagnosis And Recovery (DAR) Packet
• DAR task Perform Testing of the Processor Collects error count of the memory unit Updates state data Current consumption of the OBC module will be monitored
System ID
Length Flags Diagnostic health data/Screech data
Maqbool 10 MAPLD 2005/P181
Diagnosis And Recovery (DAR) Packet Flow
SODARP MarkerStart Sampling Current
Perform Test
DAR Packet ReceivedCompare with Stored Values
Enable InterruptsCollect SEU CountSend DAR Packet
Waiting for Supervisor Response
Command to Update Program Memory Update Memory
OBC-Processor OBC- Interface FPGA Supervisor Code Store
DAR Process StartsDisable Interrupts
Collect Current Value
Maqbool 11 MAPLD 2005/P181
Recovery MethodFault Type Recovery Procedure
Screech Reload program memory
Packet time-out (Network problems) Next Slide
Packet time_out (Processor Problem)
Next Slide
Current consumption variations Power cycle and reload memory
SEU count exceeding threshold Reload memory
Test task result mismatch Reset and reload memory
Maqbool 12 MAPLD 2005/P181
Recovery Method (2)Polling
request tothe O BC -in terface
Linkconnection
tim e-out
R eceivepacket tim e-
out
T ransm itpacket tim e-
out
“I am okay”packet
rece ived
N o
N o
N o
O BCinterfaceblocked
Supervisorin terfaceblocked
Supervisorin terfaceblocked
Yes
Yes
Yes
In terruptrequest
IN TR requesttim e-out
N M I requesttim e-out
R esetrequest
Yes
Yes
“I am okay”packet
rece ived
“I am okay”packet
rece ived
N o
N o
• In case of a processor reset and power cycle, the OBC should be allowed sufficient time for reinitialization
• The supervisor needs to keep a record of recoveries applied
• Consecutive recovery cycles needs to be avoided
Maqbool 13 MAPLD 2005/P181
Test Bed• Demonstration of the
synchronization protocol PC1 executes the OBC
program PC2 executes the
supervisor program
PC 1 PC 2
H ub
C elox icaR C 203
Parallel P
ort
E thernet
Maqbool 14 MAPLD 2005/P181
Supervisor po lls the O BC every 1s
O BC is required to send back a packet
Supervisor ho lds tim e-out check
If a tim e-out occurs, the O BC program ism anually reset
T im e-out = m axim um tim e required by theboth supervisor and O BC packets to trave l
in netw ork + m arg in
O BC sends a packet to the supervisor every1s
Supervisor ho lds tim e-out check
T im e-out = Invocation period of O BC packet+ m ax. tim e to travel + m arg in
If a tim e-out occurs, the O BC program ism anually reset
Synchronization Scheme 1 Synchronization Scheme 2
Synchronization Scheme 1
OBC program receives a packet, checks source, if it is from the supervisor program, it sends a packet to the FPGA
FPGA sends the packet to the supervisor program on Ethernet
UDP/IP Packet from the supervisor
Ethernet RC 203 board passes it as it is to the OBC-program
Parallel Port
Parallel Port
Packet on Ethernet
Configuration 1
UDP/IP Packet from the supervisor
Ethernet RC 203 passes only data bytes
OBC program receives data, it sends data bytes to the FPGA
Parallel Port
Parallel Port
FPGA encodes received data into UDP/IP packet
Packet on Ethernet
Configuration 2
Maqbool 15 MAPLD 2005/P181
Time Measurement Method
Maqbool 16 MAPLD 2005/P181
• The ethereal graphical user interface (GUI) network protocol analyzer was used
It displays time when a packet was capturedIt also displays IP source and destination, protocol type source and destination port for all captured packets. Selecting a packet from the list of captured packets shows total bytes captured on the network medium, Ethernet source and destination addresses, and number of data bytes in the packet.
• Time was measured from the moment it captures packet sent by the supervisor to the point when it captures a return packet from the OBC for synchronization scheme 1.
• For synchronization scheme 2, time was measured between two consecutive packets from the OBC.
Results (1)
Maqbool 17 MAPLD 2005/P181
Data bytes
Average time between two packets in a pair (Supervisor packet and OBC packet in response)
(ms)
Time required for 1 byte to travel through the systemTime measured in n run with N bytes – time measured in n+1 run with N+K bytes divided by K
(s)
18 101.053 101.264-101.053/18 = 11.7
36 101.264 101.822-101.264/36 = 15.5
72 101.822 102.229-101.822/28 = 14.53
100 102.229 103.677-102.229/400 = 14.85
500 103.677
Results (2)
Maqbool 18 MAPLD 2005/P181
Data bytes Experiment Time measured between a supervisor packet and a response packet from the OBC
(s)
18 Configuration 1 with
RC200GetBlockStall function
101053
18 Configuration 1 with
RC200GetBlock function
1201
18 Configuration 2 with
RC200GetBlockStall function
101066
18 Ping program 270
Synchronization Scheme 2
Maqbool 19 MAPLD 2005/P181
Configuration for Synchronization Scheme 2
OBC sends data bytes to the interface FPGA
Interface FPGA encodes data into UDP/IP packet and writes it on Ethernet
Parallel PortEthernet
Experiment Time measured
Synchronization scheme 2: Time measured between two consecutive packets from OBC (18 data bytes)
261 s
Synchronization scheme 1:
OBC program crashed and reinitialized manuallyFPGA cleared using FTU facility and OBC program reinitialized manually
Time between last OBC packet prior to fault and first packet after recovery12 s, 299 ms and 644 s15 s, 339ms and 501 s
Synchronization scheme 2:
OBC program crashed and reinitialized manuallyFPGA cleared using FTU facility and OBC program reinitialized manually
Time between last OBC packet prior to fault and first packet after recovery11 s, 316 ms and 272 s14 s, 971ms and 559 s
ConclusionsA system-level approach has been presented to mitigate SEFIs in data
handling architectures Upset detection is not straightforward, limits effectiveness of currently
available mitigation techniques Increasing SEFI susceptibility in all major data handling device technologies A system level intelligent supervisor allows monitoring of a wide range of
devices with minimal overhead Synchronization is straightforward Two synchronization schemes have been demonstrated Few simple experiments were performed to establish a time-out period for a
packet from the OBC. Once this information was achieved, the system behaved as expected and a
synchronized packet communication was established between the OBC and the supervisor programs
In event of a SEFI, the supervisor program needs to wait until the OBC program is up again. Time-out for this wait period will depend on the recovery latency
Maqbool 20 MAPLD 2005/P181