self-configuring tmr scheme utilizing discrepancy ... tmr scheme utilizing discrepancy resolution...
TRANSCRIPT
Self-Configuring TMR Scheme utilizing Discrepancy Resolution
(SCDR)
1
Naveed Imran and Ronald F. DeMara Department of Electrical Engineering and Computer Science,
University of Central Florida E-mail: [email protected], [email protected]
International Conference on ReConFigurable Computing and FPGAs (ReConFig 2011)
Nov- Dec 2011, Cancun, Mexico
Agenda
1. Related Work 2. SCDR: A Self-Configuring TMR
RO-1: How to make TMR adaptive? achieve self-healing RO-2: How to perform Evolvable Hardware (EHW) rapidly? less reconfiguration overhead
3. Operational Flow of Self-Healing Process 4. Experiment Design and FIR Results 5. Recovery Comparison 6. Conclusions
2
• Harsh Environments Inevitable hardware failures Aging (TDDB, EM), Local Permanent Damage, etc. Yet scrubbing: only Soft Errors.
• Mission Critical Autonomy, safety, financial impact Inaccessible Space, Deep-Sea, unmanned Avionic missions
• Fault Tolerance Mandatory (?) Fault Avoidance. “Always Possible?”… No (?) Design Margin. “Always Adequate?”… No (?) Modular Redundancy. “Always Recoverable?”…No Autonomous Fault Refurbishment. “Highly Flexible?” … Yes! How to achieve?
Sustain performance via failed components in alternate modes Reconfiguration Develop a design-independent repair technique Evolvable Hardware? Refurbish within time constraints availability α (1/MTTR) Use on-line inputs instead of test vectors
How can a system sustain itself within mission lifetime specifications when operating in failure-prone deployments?
Need for Autonomous Sustainable Systems
Unforeseen Events… Limited Human Intervention…
Weight/Size/Power Constraints….
3
Our Goal: Autonomous FPGA Refurbishment
Redundancy
increases with amount of spare capacity
restricted at design-time
based on time required to select spare resource
determined by adequacy of
spares available (?) yes
Refurbishment
weakly-related to number recovery capacity variable at recovery-time based on time required to
find suitable recovery
affected by multiple
characteristics (+ or -) yes
Overhead from Unutilized Spares weight, size, power Granularity of Fault Coverage resolution where fault handled Fault-Resolution Latency availability via downtime required to handle fault Quality of Repair likelihood and completeness Autonomous Operation fix without outside intervention
increase availability without carrying pre-configured spares …
TMR: XTMR, BLTMR EHW
4
Related Work
Resource testing techniques • Online Built-In Self Testing (BIST) method for fault isolation of FPGA
resources [Abramovici2001],[Gericota2008],[Dutt2008] Redundancy-based techniques • Concurrent Error Detection (CED) arrangement [DeMara2005] • Triple Modular Redundant(TMR) arrangement -> Fault masking in the
system output via a majority voter [Carmichael2006],[Morgan2007] • TMR with Single-Module Recovery [Garvie2004], [Al-Haddad2011] Evolutionary techniques • Genetic Algorithms (GAs) have been employed to generate circuits at
design-time which are robust to faults [Keymeulen2000] • Online evolutionary regeneration [Zhang2005] Permanent Fault Handling • Permanent faults especially due to aging effects in sub-90nm
technology are noted in literature [Bolchini2010] • Further the reduced feature size of the future SRAM-based FPGAs
may increase their vulnerability to runtime as well as permanent faults [Rao2011] etc, and the performance of nano-scale future systems.
5
SCDR: A Self-Configuring TMR
• Improved fault capacity beyond conventional TMR − Handle multiple simultaneous faults in multiple triplicate modules Sustainability
• Fitness evaluation by using the actual inputs of the system, avoiding any test vectors Availability
• Advantage: explicit fault isolation phase is unnecessary. The faulty modules are autonomously avoided by the evolutionary recovery process.
• Applicability: fully support pipelined circuits
6
Reliability
• RO-1: Sustaining performance after multiple failures to improve a mission lifetime
• Approach: Self-configuring evolvable hardware using SRAM-based FPGAs
• R0-2: Autonomous Fault Refurbishment: Employ static or dynamic reconfiguration capability to provide runtime adaptability
TMR Realization of Pipelined Circuit
7
• To improve the reliability, in the presence of multiple faults which may occur during long mission durations, adaptation of the datapaths can be applied.
• The SCDR concept is to avoid the faulty blocks and utilize the healthy blocks in the processing datapath.
• Initiate fault recovery process if two or more replicated datapaths are fault-affected.
• Upon fault detection in two instances of a TMR arrangement, the system is recovered through reconfiguration by rearranging stages in two instances
• The configuration of Module Switch (MS) elements determine which module is selected as an active element of each instance
Adaptation of the TMR Pathways by Altering Datapath at Runtime
• A Router Box (RB) is an abstraction of the three MS elements, one for each instance of a TMR pathway. It supports six possible input-output pairings (i.e. 3!=6 ).
• An RB can be implemented in hardware either by routing wires through partial reconfiguration or by using multiplexers.
• Using Multiplexors: + Through select lines inputs, various configurations of the
RB can be explored in an efficient manner - They remain in the datapath at all times, even after fault
recovery ! • Using Partial Reconfiguration (PR): (+) Reduced latency overhead in datapath (-) Reconfiguration time can be considerable (-) Bus Macros area overhead • RO-2: faster EHW Higher Availability
• Exploring search space for rapid recovery with MUX • Latency reduced after solution found using PR
8
Router Box
Encoding Representation of the TMR Pathways
• Genetic Algorithms: − Implement guided trial-and-error search using principles of biological
evolution − Iterative selection enforces “survival of the fittest”
• Evolutionary fault recovery process to explore the search space of RB settings to obtain a more suitable TMR realization
• Number of variables in a GA individual = Number of pipeline stages or number of blocks, n, in an instance of the TMR arrangement
• Chromosome length = n. Each encoding field =1 of 6 routing permutations.
9
Example: One possible configuration of the TMR system An individual of the population
Operational Flow of Evolutionary Recovery Process
1. Objective for EHW recovery is specified Realize an operation CED arrangement out of
the TMR arrangement The Fitness Function is defined in terms of the
output discrepancy of the CED pair 2. Population of alternative designs is created
Initially at random, each individual of the population corresponds to one complete configuration of the TMR.
3. Genetic Algorithm invoked to evolve each alternative Fitness evaluated for alternatives using
reconfiguration of RB connections. Genetic Operators used to increase fitness
4. Fitness Exit Criteria checked If Fitness Score != 0 then repeat Step 3
5. Best design represents desired TMR configuration
10
The recovery process in the context of a standard GA [Goldberg1989]
Fitness Function
• Given many possible configurations of the system, the objective is to assess which is superior in terms of desired functionality of the system
• The approach of storing the input-output test vectors or truth-tables of the
circuit to assess its fitness is avoided not tractable for large scale circuits.
• A Discrepancy Value (DV) is defined as the Euclidean distance between the outputs of two instances of a TMR in a given evaluation window. The objective is to minimize the DV between
1 and
2.
11
(in units of inputs applied)
Instance1
Instance2
Instance3
Input
Y1
Y2
Assessment of the fitness based upon relative behavior instead of absolute fitness knowledge
Fitness Evaluation & Selection
• The input is applied to the TMR arrangement and the output of the individual instances is observed during the evaluation window period.
• Upon the completion of the evaluation window, a new configuration of the TMR is introduced into operation.
• Once all the individuals have been evaluated, a generation of the GA is complete.
• Fitness scores of the individuals which is based upon their agreement/disagreement history, becomes available at the end of a generation
• Which configurations should be retained for subsequent operations?? Rank selection [Baker1985] --> configurations having small DV are retained to produce new configurations in the next generation
• Genetic Operators --> crossover and mutation
12
Experiment Design
• A 25 stage FIR filter in Verilog HDL using the Xilinx ISE 9.2i development tool.
• An ML410 development board --> Virtex-4 FPGA chip, Compact Flash interface, DDRAM, and UART.
• The circuit is triplicated to realize a TMR system. • The logic resources utilized by one instance of the TMR
arrangement 2810 number of LUTs and 500 FFs • Hard multipliers are avoided during the synthesis to simulate a
generic design • Random Stuck-At (SA) fault model -> modifying the LUT
contents in the post place-and-route simulation model. • Although a processor is a hardware overhead of the GA-based
SCDR recovery process, yet it is not on the critical throughput path.
13
Simulation Results
14
Initial population seeded with 50 individuals, each having 25 random RB configurations The crossover rate =0.8, the size of the evaluation window = 100 input samples.
The arrows depict the selected data pathway A repaired instance in the new configuration
The consensus fitness history of the population shows the GA converges to a minimum score within 70 generations
A faulty TMR configuration triggers recovery (CED lost)
Simulation Results (con’t)
15
The output of the preferred recovered TMR arrangement after completing the SCDR fault recovery process shows improved SNR measure over the faulty circuit.
The amplitude spectrum of the input signal contains two periodic sine waves of frequency 50 Hz and 100 Hz.
The cut-off frequency of the low-pass FIR filter is
set to 75Hz.
Comparison: Faults-Aware Simulation Paradigm
• To assess the search performance of the GA, the absolute fitness was assessed in the presence of knowledge about the faulty modules locations.
• The objective cost to be minimized is the utilization of a minimum number of faulty blocks on data pathway , and is defined by:
• The results indicate SCDR can operate autonomously to achieve recovery performance close to ideal knowledge of faulty stage isolation.
16
Fitness State (FS) 1: if the stage is faulty 0: if stage is healthy
Comparison: Exhaustive Search
RO-1 & RO-2: Why consider using a GA? • As there are 3 instances, each with
variables, the total number of configurations combinations to be evaluated are NE=n3
• For n=25 stage circuit, the upper bound on the required number of evaluations NE becomes 15,625.
• A population size of 50 was able to bring the correct configuration in 66 generations, thereby necessitating only 3,300 input evaluations
• Future work will be to assess and improve recovery time which may be significant for mission-critical and safety critical circuits.
17
Conclusions
• SCDR adaptive, autonomous, self-healing fault-handling strategy to extends conventional TMR
• Utilizes discrepancy information during normal throughput computation no test vectors needed: model-free operation
• FIR filter typical application decomposable into distinct pipelined stages
• Recovery of SNR (filter functionality) compared to a fault-exposed circuit:
RO-1: increased sustainability
• Improved recovery time compared to exhaustive search method RO-2: increased availability
• Current Work (with M. Lin) : development of selectively-fortified SCDR microprocessor softcore 18
References
[DeMara2005]. R. F. DeMara and K. Zhang, “Autonomous FPGA fault handling through competitive runtime reconfiguration,” in Evolvable Hardware, 2005. Proceedings. 2005 NASA/DoD Conference on, 29 2005, pp. 109 – 116.
[DeMara2011] . R. F. DeMara, K. Zhang, and C. A. Sharma, “Autonomic fault-handling and refurbishment using throughput-driven assessment,” Appl. Soft Comput., vol. 11, pp. 1588–1599, March 2011.
[Keymeulen2000]. D. Keymeulen, R. S. Zebulum, Y. Jin, and A. Stoica, “Fault tolerant evolvable hardware using field-programmable transistor arrays,” Reliability, IEEE Transactions on, vol. 49, no. 3, pp. 305–316, 2000.
[Stott2008]. E. Stott, P. Sedcole, and P. Cheung, “Fault tolerant methods for reliability in FPGAs,” in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, 8-10 2008, pp. 415 –420.
[Rao2011]. W. Rao, C. Yang, R. Karri, and A. Orailoglu, “Toward future systems with nanoscale devices: Overcoming the reliability challenge,” Computer, vol. 44, no. 2, pp. 46 –53, Feb. 2011.
[Bolchini2010]. C. Bolchini and C. Sandionigi, “Fault classification for SRAM-Based FPGAs in the space environment for fault mitigation,” Embedded Systems Letters, IEEE, vol. 2, no. 4, pp. 107 –110, Dec. 2010.
[Zhang2005]. K. Zhang, R. F. Demara, and C. A. Sharma, “Consensus-based evaluation for fault isolation and online evolutionary regeneration,” in in Proceedings of the International Conference in Evolvable Systems (ICES’05), 2005, pp. 12–24.
[Steiner2009]. N. Steiner and P. Athanas, “Hardware autonomy and space systems,” in Aerospace conference, 2009 IEEE, March 2009, pp. 1 –13.
[Mansouri2010]. I. Mansouri, F. Clermidy, P. Benoit, and L. Torres, “A runtime distributed cooperative approach to optimize power consumption in MPSoCs,” in SOC Conference (SOCC), 2010 IEEE International, Sept. 2010, pp. 25 –30.
[Abramovici2001]. M. Abramovici, J. M. Emmert, and C. E. Stroud, “Roving STARs: an integrated approach to on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive computing systems,” in Evolvable Hardware, 2001. Proceedings. The Third NASA/DoD Workshop on, 2001, pp. 73–92.
[Gericota2008]. M. Gericota, G. Alves, M. Silva, and J. Ferreira, “Reliability and availability in reconfigurable computing: A basis for a common solution,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 11, pp. 1545 –1558, Nov. 2008.
19
References (con’t)
[Dutt2008]. S. Dutt, V. Verma, and V. Suthar, “Built-in-self-test of FPGAs with provable diagnosabilities and high diagnostic coverage with application to online testing,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 27, no. 2, pp. 309 –326, Feb. 2008.
[Carmichael2006]. C. Carmichael, “Triple module redundancy design techniques for virtex FPGAs,” Xilinx Application Note: Virtex Series, 2006.
[Morgan2007]. K. Morgan, D. McMurtrey, B. Pratt, and M. Wirthlin, “A comparison of TMR with alternative fault-tolerant design techniques for FPGAs,” Nuclear Science, IEEE Transactions on, vol. 54, no. 6, pp. 2065 –2072, Dec. 2007.
[Berrocal2010] A. Jara-Berrocal and A. Gordon-Ross, “VAPRES: a virtual architecture for partially reconfigurable embedded systems,” in Proceedings of the Conference on Design, Automation and Test in Europe, 2010, pp. 837–842.
[Kuehnle2011]. M. Kuehnle, A. Brito, C. Roth, K. Dagas, and J. Becker, “The study of a dynamic reconfiguration manager for systems-onchip,” Annual Symposium on VLSI, 2011.
[Shannon2005]. L. Shannon, “Leveraging reconfigurability in the design process,” International Conference on Field Programmable Logic and Applications, vol. 0, pp. 731–732, 2005.
[Estrada2010]. S. Lopez-Estrada and R. Cumplido, “Hardware architecture for adaptive filtering based on energy-CFAR processor for radar target detection,” IEICE Electronics Express, vol. 7, no. 9, pp. 628–633, 2010.
[Goldberg1989]. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[Baker1985]. J. E. Baker, “Adaptive selection methods for genetic algorithms,” in Proceedings of the 1st International Conference on Genetic Algorithms. Hillsdale, NJ, USA: L. Erlbaum Associates Inc., 1985, pp. 101–111.
[Garvie2004] M. Garvie and A. Thompson, "Scrubbing away transients and jiggling around the permanent: long survival of FPGA systems through evolutionary self-repair," in On-Line Testing Symposium, 2004. IOLTS 2004. Proceedings. 10th IEEE International, 2004, pp. 155-160.
[Al-Haddad2011] R. Al-Haddad, R. Oreifej, R. A. Ashraf, and R. F. DeMara, “Sustainable Modular Adaptive Redundancy Technique Emphasizing Partial Reconfiguration for Reduced Power Consumption,” International Journal of Reconfigurable Computing, vol. 2011.
20