self-configuring tmr scheme utilizing discrepancy ... tmr scheme utilizing discrepancy resolution...

Self-Configuring TMR Scheme utilizing Discrepancy Resolution

(SCDR)

1

Naveed Imran and Ronald F. DeMara Department of Electrical Engineering and Computer Science,

University of Central Florida E-mail: [email protected], [email protected]

International Conference on ReConFigurable Computing and FPGAs (ReConFig 2011)

Nov- Dec 2011, Cancun, Mexico

Agenda

1. Related Work 2. SCDR: A Self-Configuring TMR

RO-1: How to make TMR adaptive? achieve self-healing RO-2: How to perform Evolvable Hardware (EHW) rapidly? less reconfiguration overhead

3. Operational Flow of Self-Healing Process 4. Experiment Design and FIR Results 5. Recovery Comparison 6. Conclusions

2

• Harsh Environments Inevitable hardware failures Aging (TDDB, EM), Local Permanent Damage, etc. Yet scrubbing: only Soft Errors.

• Mission Critical Autonomy, safety, financial impact Inaccessible Space, Deep-Sea, unmanned Avionic missions

• Fault Tolerance Mandatory (?) Fault Avoidance. “Always Possible?”… No (?) Design Margin. “Always Adequate?”… No (?) Modular Redundancy. “Always Recoverable?”…No Autonomous Fault Refurbishment. “Highly Flexible?” … Yes! How to achieve?

Sustain performance via failed components in alternate modes Reconfiguration Develop a design-independent repair technique Evolvable Hardware? Refurbish within time constraints availability α (1/MTTR) Use on-line inputs instead of test vectors

How can a system sustain itself within mission lifetime specifications when operating in failure-prone deployments?

Need for Autonomous Sustainable Systems

Unforeseen Events… Limited Human Intervention…

Weight/Size/Power Constraints….

3

Our Goal: Autonomous FPGA Refurbishment

Redundancy

increases with amount of spare capacity

restricted at design-time

based on time required to select spare resource

determined by adequacy of

spares available (?) yes

Refurbishment

weakly-related to number recovery capacity variable at recovery-time based on time required to

find suitable recovery

affected by multiple

characteristics (+ or -) yes

Overhead from Unutilized Spares weight, size, power Granularity of Fault Coverage resolution where fault handled Fault-Resolution Latency availability via downtime required to handle fault Quality of Repair likelihood and completeness Autonomous Operation fix without outside intervention

increase availability without carrying pre-configured spares …

TMR: XTMR, BLTMR EHW

4

Related Work

Resource testing techniques • Online Built-In Self Testing (BIST) method for fault isolation of FPGA

resources [Abramovici2001],[Gericota2008],[Dutt2008] Redundancy-based techniques • Concurrent Error Detection (CED) arrangement [DeMara2005] • Triple Modular Redundant(TMR) arrangement -> Fault masking in the

system output via a majority voter [Carmichael2006],[Morgan2007] • TMR with Single-Module Recovery [Garvie2004], [Al-Haddad2011] Evolutionary techniques • Genetic Algorithms (GAs) have been employed to generate circuits at

design-time which are robust to faults [Keymeulen2000] • Online evolutionary regeneration [Zhang2005] Permanent Fault Handling • Permanent faults especially due to aging effects in sub-90nm

technology are noted in literature [Bolchini2010] • Further the reduced feature size of the future SRAM-based FPGAs

may increase their vulnerability to runtime as well as permanent faults [Rao2011] etc, and the performance of nano-scale future systems.

5

SCDR: A Self-Configuring TMR

• Improved fault capacity beyond conventional TMR − Handle multiple simultaneous faults in multiple triplicate modules Sustainability

• Fitness evaluation by using the actual inputs of the system, avoiding any test vectors Availability

• Advantage: explicit fault isolation phase is unnecessary. The faulty modules are autonomously avoided by the evolutionary recovery process.

• Applicability: fully support pipelined circuits

6

Reliability

• RO-1: Sustaining performance after multiple failures to improve a mission lifetime

• Approach: Self-configuring evolvable hardware using SRAM-based FPGAs

• R0-2: Autonomous Fault Refurbishment: Employ static or dynamic reconfiguration capability to provide runtime adaptability

TMR Realization of Pipelined Circuit

7

• To improve the reliability, in the presence of multiple faults which may occur during long mission durations, adaptation of the datapaths can be applied.

• The SCDR concept is to avoid the faulty blocks and utilize the healthy blocks in the processing datapath.

• Initiate fault recovery process if two or more replicated datapaths are fault-affected.

• Upon fault detection in two instances of a TMR arrangement, the system is recovered through reconfiguration by rearranging stages in two instances

• The configuration of Module Switch (MS) elements determine which module is selected as an active element of each instance

Adaptation of the TMR Pathways by Altering Datapath at Runtime

• A Router Box (RB) is an abstraction of the three MS elements, one for each instance of a TMR pathway. It supports six possible input-output pairings (i.e. 3!=6 ).

• An RB can be implemented in hardware either by routing wires through partial reconfiguration or by using multiplexers.

• Using Multiplexors: + Through select lines inputs, various configurations of the

RB can be explored in an efficient manner - They remain in the datapath at all times, even after fault

recovery ! • Using Partial Reconfiguration (PR): (+) Reduced latency overhead in datapath (-) Reconfiguration time can be considerable (-) Bus Macros area overhead • RO-2: faster EHW Higher Availability

• Exploring search space for rapid recovery with MUX • Latency reduced after solution found using PR

8

Router Box

Encoding Representation of the TMR Pathways

• Genetic Algorithms: − Implement guided trial-and-error search using principles of biological

evolution − Iterative selection enforces “survival of the fittest”

• Evolutionary fault recovery process to explore the search space of RB settings to obtain a more suitable TMR realization

• Number of variables in a GA individual = Number of pipeline stages or number of blocks, n, in an instance of the TMR arrangement

• Chromosome length = n. Each encoding field =1 of 6 routing permutations.

9

Example: One possible configuration of the TMR system An individual of the population

Operational Flow of Evolutionary Recovery Process

1. Objective for EHW recovery is specified Realize an operation CED arrangement out of

the TMR arrangement The Fitness Function is defined in terms of the

output discrepancy of the CED pair 2. Population of alternative designs is created

Initially at random, each individual of the population corresponds to one complete configuration of the TMR.

3. Genetic Algorithm invoked to evolve each alternative Fitness evaluated for alternatives using

reconfiguration of RB connections. Genetic Operators used to increase fitness

4. Fitness Exit Criteria checked If Fitness Score != 0 then repeat Step 3

5. Best design represents desired TMR configuration

10

The recovery process in the context of a standard GA [Goldberg1989]

Fitness Function

• Given many possible configurations of the system, the objective is to assess which is superior in terms of desired functionality of the system

• The approach of storing the input-output test vectors or truth-tables of the

circuit to assess its fitness is avoided not tractable for large scale circuits.

• A Discrepancy Value (DV) is defined as the Euclidean distance between the outputs of two instances of a TMR in a given evaluation window. The objective is to minimize the DV between

1 and

2.

11

(in units of inputs applied)

Instance1

Instance2

Instance3

Input

Y1

Y2

Assessment of the fitness based upon relative behavior instead of absolute fitness knowledge

Fitness Evaluation & Selection

• The input is applied to the TMR arrangement and the output of the individual instances is observed during the evaluation window period.

• Upon the completion of the evaluation window, a new configuration of the TMR is introduced into operation.

• Once all the individuals have been evaluated, a generation of the GA is complete.

• Fitness scores of the individuals which is based upon their agreement/disagreement history, becomes available at the end of a generation

• Which configurations should be retained for subsequent operations?? Rank selection [Baker1985] --> configurations having small DV are retained to produce new configurations in the next generation

• Genetic Operators --> crossover and mutation

12

Experiment Design

• A 25 stage FIR filter in Verilog HDL using the Xilinx ISE 9.2i development tool.

• An ML410 development board --> Virtex-4 FPGA chip, Compact Flash interface, DDRAM, and UART.

• The circuit is triplicated to realize a TMR system. • The logic resources utilized by one instance of the TMR

arrangement 2810 number of LUTs and 500 FFs • Hard multipliers are avoided during the synthesis to simulate a

generic design • Random Stuck-At (SA) fault model -> modifying the LUT

contents in the post place-and-route simulation model. • Although a processor is a hardware overhead of the GA-based

SCDR recovery process, yet it is not on the critical throughput path.

13

Simulation Results

14

Initial population seeded with 50 individuals, each having 25 random RB configurations The crossover rate =0.8, the size of the evaluation window = 100 input samples.

The arrows depict the selected data pathway A repaired instance in the new configuration

The consensus fitness history of the population shows the GA converges to a minimum score within 70 generations

A faulty TMR configuration triggers recovery (CED lost)

Simulation Results (con’t)

15

The output of the preferred recovered TMR arrangement after completing the SCDR fault recovery process shows improved SNR measure over the faulty circuit.

The amplitude spectrum of the input signal contains two periodic sine waves of frequency 50 Hz and 100 Hz.

The cut-off frequency of the low-pass FIR filter is

set to 75Hz.

Comparison: Faults-Aware Simulation Paradigm

• To assess the search performance of the GA, the absolute fitness was assessed in the presence of knowledge about the faulty modules locations.

• The objective cost to be minimized is the utilization of a minimum number of faulty blocks on data pathway , and is defined by:

• The results indicate SCDR can operate autonomously to achieve recovery performance close to ideal knowledge of faulty stage isolation.

16

Fitness State (FS) 1: if the stage is faulty 0: if stage is healthy

Comparison: Exhaustive Search

RO-1 & RO-2: Why consider using a GA? • As there are 3 instances, each with

variables, the total number of configurations combinations to be evaluated are NE=n3

• For n=25 stage circuit, the upper bound on the required number of evaluations NE becomes 15,625.

• A population size of 50 was able to bring the correct configuration in 66 generations, thereby necessitating only 3,300 input evaluations

• Future work will be to assess and improve recovery time which may be significant for mission-critical and safety critical circuits.

17

Conclusions

• SCDR adaptive, autonomous, self-healing fault-handling strategy to extends conventional TMR

• Utilizes discrepancy information during normal throughput computation no test vectors needed: model-free operation

• FIR filter typical application decomposable into distinct pipelined stages

• Recovery of SNR (filter functionality) compared to a fault-exposed circuit:

RO-1: increased sustainability

• Improved recovery time compared to exhaustive search method RO-2: increased availability

• Current Work (with M. Lin) : development of selectively-fortified SCDR microprocessor softcore 18

References

[DeMara2005]. R. F. DeMara and K. Zhang, “Autonomous FPGA fault handling through competitive runtime reconfiguration,” in Evolvable Hardware, 2005. Proceedings. 2005 NASA/DoD Conference on, 29 2005, pp. 109 – 116.

[DeMara2011] . R. F. DeMara, K. Zhang, and C. A. Sharma, “Autonomic fault-handling and refurbishment using throughput-driven assessment,” Appl. Soft Comput., vol. 11, pp. 1588–1599, March 2011.

[Keymeulen2000]. D. Keymeulen, R. S. Zebulum, Y. Jin, and A. Stoica, “Fault tolerant evolvable hardware using field-programmable transistor arrays,” Reliability, IEEE Transactions on, vol. 49, no. 3, pp. 305–316, 2000.

[Stott2008]. E. Stott, P. Sedcole, and P. Cheung, “Fault tolerant methods for reliability in FPGAs,” in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, 8-10 2008, pp. 415 –420.

[Rao2011]. W. Rao, C. Yang, R. Karri, and A. Orailoglu, “Toward future systems with nanoscale devices: Overcoming the reliability challenge,” Computer, vol. 44, no. 2, pp. 46 –53, Feb. 2011.

[Bolchini2010]. C. Bolchini and C. Sandionigi, “Fault classification for SRAM-Based FPGAs in the space environment for fault mitigation,” Embedded Systems Letters, IEEE, vol. 2, no. 4, pp. 107 –110, Dec. 2010.

[Zhang2005]. K. Zhang, R. F. Demara, and C. A. Sharma, “Consensus-based evaluation for fault isolation and online evolutionary regeneration,” in in Proceedings of the International Conference in Evolvable Systems (ICES’05), 2005, pp. 12–24.

[Steiner2009]. N. Steiner and P. Athanas, “Hardware autonomy and space systems,” in Aerospace conference, 2009 IEEE, March 2009, pp. 1 –13.

[Mansouri2010]. I. Mansouri, F. Clermidy, P. Benoit, and L. Torres, “A runtime distributed cooperative approach to optimize power consumption in MPSoCs,” in SOC Conference (SOCC), 2010 IEEE International, Sept. 2010, pp. 25 –30.

[Abramovici2001]. M. Abramovici, J. M. Emmert, and C. E. Stroud, “Roving STARs: an integrated approach to on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive computing systems,” in Evolvable Hardware, 2001. Proceedings. The Third NASA/DoD Workshop on, 2001, pp. 73–92.

[Gericota2008]. M. Gericota, G. Alves, M. Silva, and J. Ferreira, “Reliability and availability in reconfigurable computing: A basis for a common solution,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 11, pp. 1545 –1558, Nov. 2008.

19

References (con’t)

[Dutt2008]. S. Dutt, V. Verma, and V. Suthar, “Built-in-self-test of FPGAs with provable diagnosabilities and high diagnostic coverage with application to online testing,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 27, no. 2, pp. 309 –326, Feb. 2008.

[Carmichael2006]. C. Carmichael, “Triple module redundancy design techniques for virtex FPGAs,” Xilinx Application Note: Virtex Series, 2006.

[Morgan2007]. K. Morgan, D. McMurtrey, B. Pratt, and M. Wirthlin, “A comparison of TMR with alternative fault-tolerant design techniques for FPGAs,” Nuclear Science, IEEE Transactions on, vol. 54, no. 6, pp. 2065 –2072, Dec. 2007.

[Berrocal2010] A. Jara-Berrocal and A. Gordon-Ross, “VAPRES: a virtual architecture for partially reconfigurable embedded systems,” in Proceedings of the Conference on Design, Automation and Test in Europe, 2010, pp. 837–842.

[Kuehnle2011]. M. Kuehnle, A. Brito, C. Roth, K. Dagas, and J. Becker, “The study of a dynamic reconfiguration manager for systems-onchip,” Annual Symposium on VLSI, 2011.

[Shannon2005]. L. Shannon, “Leveraging reconfigurability in the design process,” International Conference on Field Programmable Logic and Applications, vol. 0, pp. 731–732, 2005.

[Estrada2010]. S. Lopez-Estrada and R. Cumplido, “Hardware architecture for adaptive filtering based on energy-CFAR processor for radar target detection,” IEICE Electronics Express, vol. 7, no. 9, pp. 628–633, 2010.

[Goldberg1989]. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.

[Baker1985]. J. E. Baker, “Adaptive selection methods for genetic algorithms,” in Proceedings of the 1st International Conference on Genetic Algorithms. Hillsdale, NJ, USA: L. Erlbaum Associates Inc., 1985, pp. 101–111.

[Garvie2004] M. Garvie and A. Thompson, "Scrubbing away transients and jiggling around the permanent: long survival of FPGA systems through evolutionary self-repair," in On-Line Testing Symposium, 2004. IOLTS 2004. Proceedings. 10th IEEE International, 2004, pp. 155-160.

[Al-Haddad2011] R. Al-Haddad, R. Oreifej, R. A. Ashraf, and R. F. DeMara, “Sustainable Modular Adaptive Redundancy Technique Emphasizing Partial Reconfiguration for Reduced Power Consumption,” International Journal of Reconfigurable Computing, vol. 2011.

20

self-configuring tmr scheme utilizing discrepancy ... tmr scheme utilizing discrepancy resolution...

Documents