evaluating impact of soft-errors in an embedded system

24
Evaluating Impact of Soft- Errors in an Embedded System - Vijay Sheshadri Graduate Student Dept. of Electrical Engineering

Upload: stesha

Post on 10-Feb-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Evaluating Impact of Soft-Errors in an Embedded System. Vijay Sheshadri Graduate Student Dept. of Electrical Engineering. What is a Soft-error?. Transient fault caused by cosmic ray particles. . Sufficient charge collection causes an erroneous bit-flip. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Evaluating Impact of Soft-Errors in an Embedded System

Evaluating Impact of Soft-Errors in an Embedded System

-Vijay SheshadriGraduate Student

Dept. of Electrical Engineering

Page 2: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 2

What is a Soft-error?

Transient fault caused by cosmic ray particles.

1 0

A charged particle incident on a component

The charged particle creates EHPs which get collected by the drain

Sufficient charge collection causes an erroneous bit-flip

Page 3: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 3

Soft-error in a SystemBit

Read

Bit has error

protection

Erroris only detected(e.g., parity + no recovery)

Error can be corrected(e.g, ECC)

yes no

Does bit matter?

Silent Data Corruption

(SDC)

yesyes

no

Detected, but unrecoverable error

(DUE)

no error

yes no

benign faultno error

benign faultno error

Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005

Page 4: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 4

Masking of Soft-error

REGISTERS

I1I2

I3I4I5I6I7

C

E

D

B

REGISTERS

O2

O1

1

1

10

1

0

10

Particle strike

Electrical masking

Soft error

No soft error

latching window masking

Logical Masking

4

Page 5: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 5

FIT Equation: Vulnerability Factors

FIT = (for each vulnerable device i) (intrinsic error ratei * vulnerability factori)

Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor Timing Vulnerability Factor (TVF)

fraction of time bit is vulnerable

Architectural Vulnerability Factor (AVF) fraction of time bit matters for final output of a program

Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005

Page 6: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 6

Architectural Vulnerability Factor Fraction of time bit matters for final output of a program

Branch Predictor Doesn’t matter at all (AVF = 0%)

Program Counter Almost always matters (AVF ~ 100%)

Computing AVF for complex structures Statistical Fault Injection ACE (Architecturally Correct Execution) Analysis

Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005

Page 7: Evaluating Impact of Soft-Errors in an Embedded System

Soft-error & Automobiles

Mar,2010 - NHTSA enlisted NASA Engineering and Safety Center (NESC) to investigate “Unintended Acceleration”

Apr,2011 – NESC discounts SEU in its report to NHTSA stating that the ICs manufactured using SOI (Silicon-on-insulator) technology

As per AEC-Q100 standard, SEU testing required for automobile electronics with RAM > 1Mb

April 22, 2023 7

Page 8: Evaluating Impact of Soft-Errors in an Embedded System

An Example

Predicted Block RAM upset rates for a Virtex-5 FPGA = 635 FIT/Mb = 1.5E-05 upsets per day per Mb. Ref : A. Lesea, “Continuing Experiments of Atmospheric

Neutron Effects on Deep Submicron Integrated Circuits,” WP286 (v1.0), Xilinx, Inc. 2008

Assume this FPGA used in throttle control module If 500,000 such vehicles produced by vendor, then total

upsets per day = 1.5E-05 x 500,000 = 7.6 vehicle upsets per day

April 22, 2023 8

Page 9: Evaluating Impact of Soft-Errors in an Embedded System

Soft-error Mitigation

Robust circuit designs (radiation-hardenend) resilient to soft-errors

Soft-error mitigation at Device-level – silicon-on-insulator, triple-well Circuit-level – DICE cell, Triple-modular redundancy Architecture-level – RMT, lock-stepping, ECC

April 22, 2023 9

Page 10: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 10

Soft-error Mitigation

Soft-error mitigation techniques incur penalties in area (spatial redundancy) timing (temporal redundancy)

Selective hardening of the components for reduced penalty Often based on logical/electrical/timing derating

A low cost mitigation technique proposed for critical applications based on application derating Certain applications can mask or recover from transient faults*

Ref: V. Wong et al, “Soft Error Resilience of Probabilistic Inference Applications” SELSE II, 2006

Page 11: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 11

Critical Application - An Analogy

Climate monitor/display

Airbag deployment

GPS

Cruise control

• A micro-controller embedded in a car dashboard maybe handling many applications.

• A critical application in this case could be ‘Airbag deployment’.

• SE during this application could be catastrophic

Page 12: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 12

Target Module

PWM – output is a pulse, width of which decides speed of motor.

Etpwmi0 module ~800 FFs & ~3000 logic gates 180-nm CMOS technology, 80 MHz frequency

ADCCPU core

PWM

Motor

Page 13: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 13

Basic Simulation Steps*

Pre-analysis: Identify components utilized by critical application

Fault injection: Inject a single fault at random time instance by depositing the opposite value on the component

Error metric: Error count => no. of mismatches b/w output and reference PW count => no. of clock-cycles the output is ‘1’ as compared

to reference

Ref: J. Blome et al, “Cost-Efficient Soft Error Protection for Embedded Microprocessors” CASES, 2006

Page 14: Evaluating Impact of Soft-Errors in an Embedded System

Simulation tools

Verilog netlist simulated with timing information, using Synopsys VCS

Fault-injection module coded in C. Uses VPI (verilog procedural interface) functions to

Access a net in the netlist (vpiHandle) Read value of the net (vpi_get_value) Overwrite value of the net (vpi_put_value)

April 22, 2023 14

Page 15: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 15

Simulation – Pre-analysis

Pre-analysis Categorize FFs based on their activity

a) Low-activity FFs (no. of toggles less than 2)b) High-activity FFs (no. of toggles higher than 2)

Opposite values forced and output pulse observed for errors

FFs in which errors were observed are identified and subjected to fault-injection

Page 16: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 16

Simulation – Fault-injection

Fault injection For the FFs obtained from pre-analysis, inject fault at a

random instance of time (within time interval of first output pulse)

Measure Error count & PW count. Identify FFs with error in acceptable limits

Fault-injection window

Output pulse

Original valueTest

bench

Fault-injection module

(verilog) (C+VPI)

Modified value

Page 17: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 17

Absolute error vs. Acceptable error

Absolute error – Raise error flag for any mismatch b/w the output pulse and reference

Acceptable error - Raise error flag only if mismatch b/w the output pulse and reference lies outside tolerance limit*

Examples: Delayed pulse - Self-correcting pulse

Fault-injected here

Target FF

Actual output

reference copy

Fault-injected here

Target FF

reference copy

Actual output

delay

Ref: X. Li, et al “Exploiting Soft Computing for Increased Fault Tolerance” Workshop on Architectural Support for Gigascale Integration, 2006

Page 18: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 18

Simulations-Combinational logic

Fault injection steps: SE modeled as a 1ns pulse (System Clock Freq = 80MHz) Transient pulse injected onto the gate output Target combinational circuit selected at random Example: 2-input NAND gate

Actual output

reference copy

A

B

Y

Injected Fault

A

BY

Page 19: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 19

Results

Pre-analysis - ~18% FFs used by the application

Fault-injection - number of faults injected is proportional to the number of flip-flops in the group

Low-toggle FFs more in number, hence no. of faults injected in low-toggle FF is higher

Page 20: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 20

Results

Low-toggle FF more vulnerable to soft-errors since an erroneous bit-flip may remain unchanged

High-toggle FF is written very often, an erroneous bit flip has a higher probability of getting overwritten

Page 21: Evaluating Impact of Soft-Errors in an Embedded System

April 22, 2023 21

Computing AVF

AVF = Pe * % component

Pe = probability that a fault injected in the component results in an error (Pe) = (no. of errors) / (no. of faults injected)

% component = the percentage of that component with respect to total number of components

Example: For a latch, a. if # errors = 50% of injected faults (Pe = 0.5)b. if latches make for 20% of circuit

AVF = 0.5 x 0.2 = 0.1

Page 22: Evaluating Impact of Soft-Errors in an Embedded System

AVF - Results

Low activity FF have a higher Pe and are more in number; hence have a higher AVF

Combinational logic, though high in number, has Pe ~4E-03, causing AVF to drop

04/22/23 22

Page 23: Evaluating Impact of Soft-Errors in an Embedded System

Summary Fault-resilience scheme for critical applications using

application derating and inherent error tolerance

For the application considered, ~12% of the sequential logic was safety critical (prev. work

reports 30% of seq. logic hardened for 99% fault-coverage in ARM embedded proc. running image processing algorithm)

failures in combinational logic were negligible

Worst-case scenario would only be the same as radiation-hardening a generic system i.e., all the hardware is identified as safety-critical

04/22/23 23

Page 24: Evaluating Impact of Soft-Errors in an Embedded System

Future Work

Perform fault-injection analysis on the processor core managing the control loop

Conduct neutron beam experiments on the circuit to compare with simulations and find FIT rate

Implement circuit hardening and test the system to ascertain its robustness

04/22/23 24