1 czajkowskimapld 2005/139 sefi mitigation technique for cots microprocessors – proton testing...

23
1 Czajkowsk i MAPLD 2005/139 SEFI Mitigation Technique for COTS Microprocessors – Proton Testing Demonstration D. Czajkowski, P. Samudrala, D. Strobel, and M. Pagey Space Micro Inc., Suite 400 San Diego, CA 92121

Upload: william-newman

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

1Czajkowski MAPLD 2005/139

SEFI Mitigation Technique for COTSMicroprocessors – Proton Testing

Demonstration

D. Czajkowski, P. Samudrala, D. Strobel, and M. Pagey

Space Micro Inc., Suite 400

San Diego, CA 92121

2Czajkowski MAPLD 2005/139

➢ Single Event Functional Interrupts (SEFI) introduction ➢ H-Core chip for SEFI mitigation in microprocessors➢ Tests/Results for Pentium P-III processor➢ Tests/Results for TI-DSP Processor➢ Tests/Results for BSP Processor➢ Summary

Overview

3Czajkowski MAPLD 2005/139

➢ EIA/JEDEC Standard No. 57 (1996): “ The loss of functionality of the device that does not require cycling of the devices power to restore operability unlike SEL and does not result in permanent damage as in SEB”➢ SEFIs observed in various complex integrated circuits: EEPROMs, DRAMs, ADC/DACs, and Microprocessors.➢ Most common solution for SEFIs is to power cycle the system: “Even single bit-flips can create circuit-level effects that cause string of errors, or even result in a “lock-up” condition that requires removal of power and subsequent re-initialization to resume proper operation”

Single Event Functional Interrupts

4Czajkowski MAPLD 2005/139

➢ SEFI Characteristics✔ Processor Hangs Suddenly➢ Probable causes of “Hangs”✔ Illegal branching✔ Upsets in program counter of the CPU✔ Jumps to undefined/test states➢ Approx. rates : 1 per 100 days for SOI Power PC and 1per 10 for CMOS version➢ Current solution to power cycle the system➢ Results in unnecessary delays and data loss

SEFIs in Microprocessors

5Czajkowski MAPLD 2005/139

➢ H-Core✔ Combination of Software and Hardware ✔ Monitors CPU Functionality✔ Stores rollback information✔ Detects and indicates SEFI occurrences ✔ Revives CPU from SEFI events➢ H-Core

✔ Sends CPU alive messages✔ Saves periodic roll-back information✔ Reads SEFI indicator from H-Core chip, and✔ Recovers running processes after SEFI events

H-Core Technique

CPU

BusController

Memory

Ethernet

SCSIHBA

H-CoreChip

➢ H-Core✔ Combination of Software and Hardware ✔ Monitors CPU Functionality✔ Stores rollback information✔ Detects and indicates SEFI occurrences ✔ Revives CPU from SEFI events➢ H-Core

✔ Sends CPU alive messages✔ Saves periodic roll-back information✔ Reads SEFI indicator from H-Core chip, and✔ Recovers running processes after SEFI events

H-Core Technique

CPU

BusController

Memory

Ethernet

SCSIHBA

H-CoreChip

6Czajkowski MAPLD 2005/139

The H-Core Chip

➢ Manufactured using rad-hard components➢ Usable with any processor➢ Provides min. 8 interrupt signals➢ Uses MOSFET driver for power cycle➢ Provides variable levels and pulse widths of interrupts➢ Contains programmable CPU check timer➢ Sets SEFI status signal for SEFI recovery software➢ Provides external reset control

7Czajkowski MAPLD 2005/139

Radiation Tests➢ Radiation Tests on three different processors were performed at Crocker Nuclear Laboratory at UC, Davis

➢ Processors used✔ Pentium P-III✔ TI TMS320C6713 DSP✔ Equator BSP-15 DSP

➢ Each processor was bombarded with radiation to induce a SEFI

➢ H-Core circuit was then used in mitigating the induced SEFIs

➢ The following slides discuss the tests

8Czajkowski MAPLD 2005/139

Irradiation Test Procedure

➢ Verify test loop results without radiation

➢ Start irradiation and monitor test loop results➢ Stop irradiation when incorrect/no test loop results received

➢ Assert H-Core signals in sequence to revive the processor

➢ CPU assumed to be fully recovered when it responds to a signal

9Czajkowski MAPLD 2005/139

Radiation test of Intel PIII

➢ SEFI test performed by irradiating Pentium P-III processor

➢ H-Core circuit was used in bringing back the processor from SEFI

➢ Test performed at U.C, Davis, California

➢ Used a test program containing an infinite loop of arithmetic operations

10Czajkowski MAPLD 2005/139

Intel PIII Experimental SetupVSBC-8d Test Computer

Serial Serial ConsoleConsole

Monitor Computer

RS-232

Ethernet

H-Core Signals➢Software/Hardware

include:✔ SEFI board and test loop

✔ Diagnostic self-tests✔ Hardware watchdog✔ Linux software watchdog

➢Controls test loops,

➢Collects test results, and

➢Sends H-Core signals

11Czajkowski MAPLD 2005/139

Test and Monitor Software➢ VSBC-8d runs Linux OS

➢ Test loops:✔Mathematical functions test✔CPU timer test✔Network communication test✔IDE controller test

➢Monitor software:✔Serial console and telnet✔Socket communication with test loops✔Data logging during irradiation✔H-Core signal generation after SEFI

Monitor Monitor softwaresoftware

H-Core signal H-Core signal controlscontrols

OutputOutput

12Czajkowski MAPLD 2005/139

H-Core Signals for Intel P-III

➢ BINIT# Bus state machine reset

➢ INIT# Resets integer registers

➢ LINT0 General purpose interrupt signal.

➢ IRQ5 Hardware interrupt through PCI bus.

➢ LINT1 Non-maskable interrupt (NMI).

➢ RESET# Intel PIII hardware reset signal.

➢ Software, hardware, and APIC watchdogs.

13Czajkowski MAPLD 2005/139

H-Core Signal Success Rate

➢ SEFI Occurrences and Recovery

✔ 21 SEFIs detected during experiment✔ 21 SEFIs recovered using H-Core signals✔ IRQ5, NMI, and RESET# most effective signals✔ Presence of software, hardware, APIC watchdog aids recovery

14Czajkowski MAPLD 2005/139

TMS320C6713 Radiation TestMarch 2004

➢ SEFI test performed by irradiating Texas Instrument's TMS320C6713 DSP

➢ A custom board called SEFI Switch was built

➢ SEFI switch with controlling software (running on monitor computer) was used as a H-Core chip for SEFI mitigation

➢ Test performed at U.C, Davis, California

➢ Used a test program containing an infinite loop of arithmetic operations

15Czajkowski MAPLD 2005/139

USB Communicati

on

TMS320C6713 DSP Monitor Computer

➢ Development Board➢ Runs TTMR Test Loops➢ Communicates using USB-JTAG link

➢ Windows 2000 PC➢ Monitors and controls the TI-DSP board

TI-DSP Experimental Setup

16Czajkowski MAPLD 2005/139

➢CodeComposer allows remote monitoring of processor

➢All processor registers can be observed during irradiation

➢Test loop results are transmitted back to Monitor computer

➢ If the processor hangs, interrupts and reset are asserted

➢ List of Interrupts for TI processor✔ HD4✔ INT7✔ INT6✔ INT5✔ INT4✔ NMI

Monitoring TI-DSP Execution

17Czajkowski MAPLD 2005/139

Signatures of SEFI

Program counter at unexpected memory location

➢Typical SEFI signatures observed during radiation experiment:✔ Jumps to arbitrary memory locations containing random data✔ Execution of valid instructions that are not part of current program

18Czajkowski MAPLD 2005/139

➢ A total of 9 SEFIs were observed➢ Interrupts and Reset able to bring back the processor in 7 cases

➢ Observed a success rate of 77.7 % in mitigating SEFIs

➢ Reset was the most effective signal

TI-DSP Radiation Test : Results

19Czajkowski MAPLD 2005/139

H-Core Signals

BSP-15 Radiation TestAugust 2004

➢ Processor : Equator BSP-15 DSP processor

➢ Estimating the performance of H-Core circuit by bombarding BSP-15 processor with different fluxes of radiation

➢ Test Facility : U.C, Davis, California

➢ Input Program : a set of arithmetic operations running infinitely

20Czajkowski MAPLD 2005/139

BSP-15 Experimental Setup

RS-232

Ethernet

H-Core Signals

SEFI Switch Test Board

Monitor Computer

Serial Port

Ethernet Port

21Czajkowski MAPLD 2005/139

BSP-15 SEFI Test

➢ SEFI switch and the controlling program running on the Monitor computer acts as a H-Core chip

➢ Processor bombarded with radiation until it hangs➢ The interrupts viz. INTD, INTB, INTA and RESET signal are asserted after detecting a SEFI

➢ Customized interrupts service routines are called when an interrupt is pulled

22Czajkowski MAPLD 2005/139

BSP-15 Test Results

➢ The processor has encountered 26 SEFIs during the test

➢ The interrupts were asserted in the increasing order of their severity

➢ Interrupts had a success rate of 11.5% in mitigating the SEFIs

➢ However, reset had 100% success rate➢ In all the cases the processor was revived without powering cycling the board

23Czajkowski MAPLD 2005/139

✔Anatomy of SEFI illustrated using proton irradiation of Pentium P-III, TI-DSP, and Equator BSP – 15 microprocessors✔Demonstrated the effectiveness of H-Core circuit in SEFI mitigation✔Processors recovered from all detected SEFIs using H-Core signals without requiring power cycle ✔Tests indicate that H-Core has 100 % success rate in mitigating SEFIs without powering down the board

Conclusions