kastensmidt nsrec sc07 cameraready - inffglima/kastensmidt_draft.pdf · 1. radiation effects on...

86
NSREC’07 Short course Fernanda Lima Kastensmidt 1 2007 IEEE NSREC 2007 IEEE NSREC Short Course Short Course Section Section II II SEE Mitigation Strategies for SEE Mitigation Strategies for Digital Circuit Design Applicable Digital Circuit Design Applicable to to ASIC and FPGAs ASIC and FPGAs Fernanda Lima Kastensmidt Universidade Federal do Rio Grande do Sul (UFRGS)

Upload: others

Post on 18-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 1

2007 IEEE NSREC2007 IEEE NSREC Short CourseShort Course

Section Section IIII

SEE Mitigation Strategies for SEE Mitigation Strategies for Digital Circuit Design Applicable Digital Circuit Design Applicable

to to ASIC and FPGAs ASIC and FPGAs

Fernanda Lima Kastensmidt Universidade Federal do Rio Grande do Sul (UFRGS)

Page 2: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 2

SEE Mitigation Strategies for Digital Circuit Design Applicable to ASIC and FPGAs

Fernanda Lima Kastensmidt

Computer Science Department - PPGC

Universidade Federal do Rio Grande do Sul (UFRGS) Porto Alegre – RS – Brazil

[email protected]

Table of Contents

1. Radiation Effects on Digital ICs ........................................................................4 1.1 Charge Collection Mechanism in MOS devices..................................................... 4 1.2 Single Event Effects in Digital ICs........................................................................ 9

2. Radiation Hardening by Design: Strategies for ASICs ..................................14 2.1 Layout- and Electrical-level based techniques ......................................................16

2.1.1 Bulk Built-in Current Sensors ........................................................................16 2.1.2 Transistor Resizing for Charge Dissipation ...................................................18

2.2 Logic-level based techniques................................................................................20 2.2.1 Hardware redundancy techniques .................................................................20 2.2.2 Time redundancy techniques .........................................................................27 2.2.3 Mixed Hardware and Time Redundancy Techniques .....................................30 2.2.3 Hardened Memory Cells................................................................................34 2.2.4 Error Correcting Code (ECC) .......................................................................37

2.3 Architectural level based techniques.....................................................................42 2.4 Area and Performance Tradeoffs Summary..........................................................45

3. Radiation Effects on FPGAs............................................................................49 3.1 Antifuse-based FPGAs.........................................................................................49 3.2 SRAM-based FPGAs ...........................................................................................53

4. Radiation Hardening by Design: Strategies for SRAM-based FPGAs ..........65 4.1 Scrubbing.............................................................................................................68 4.2 Triple Modular Redundancy.................................................................................69 4.3 Duplication with Comparison with Concurrent Error Detection............................70

Page 3: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 3

4.4 Placement and Routing Issues ..............................................................................73 4.4.1 Solutions based on Placement and Routing....................................................74 4.4.2 Solutions based on Voting Adjustments..........................................................75

4.5 Partial Triple Modular Redundancy......................................................................76 5. Final Remarks..................................................................................................78 References ....................................................................................................................79

Page 4: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 4

1. Radiation Effects on Digital ICs

Fault-tolerance is defined as a set of techniques to provide a service capable of

fulfilling the system function in spite of (a limited number of) faults. Fault-tolerance on

semiconductor devices has been meaningful since upsets were first experienced in space

applications several years ago. Since then, the interest in studying fault-tolerant

techniques in order to keep integrated circuits (ICs) operational in such hostile

environment has increased, driven by all possible applications of radiation tolerant

circuits, such as space missions, satellites, high-energy physics experiments and others.

Spacecraft systems include a large variety of analog and digital components that are

potentially sensitive to radiation and therefore fault-tolerant techniques must be used to

ensure reliability.

In addition, because of the continuous evolution of the fabrication technology

process of semiconductor components, in terms of transistor geometry shrinking, power

supply, speed, and logic density, as presented in International Technology Roadmap for

Semiconductors (ITRS) [72], the fault-tolerance starts to be a matter of concern for

circuits operating at ground level as well. As stated in [32], [36], [59], [24], [21], [26] and

[25], drastic device shrinking, power supply reduction, and increasing operating speeds

significantly reduce the noise margins and thus increase the threats that very deep

submicron (VDSM) ICs face from the various internal sources of noise. This process is

now approaching a point where it will be unfeasible to produce ICs that are free from

these effects. Consequently, fault-tolerance is no longer a concern exclusively for space

designers but also for designers of next generation products, which must cope with errors

at ground level due to the advanced technology.

1.1 Charge Collection Mechanism in MOS devices The radiation environment is composed of various particles generated by sun

activity, as presented by [7]. The particles can be classified as two major types: (1)

energetic particles such as electrons, protons and heavy ions, and (2) electromagnetic

radiation (photons), which can be x-ray, gamma ray, or ultraviolet light. The main

sources of energetic particles that contribute to radiation effects are protons and electrons

Page 5: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 5

trapped in the Van Allen belts, heavy ions trapped in the magnetosphere, galactic cosmic

rays and solar flares. The charged particles interact with the silicon atoms causing

excitation and ionization of atomic electrons.

At the ground level, the neutrons are the most frequent cause of upset as shown by

[57, 58]. Neutrons are created by cosmic ion interactions with the oxygen and nitrogen in

the upper atmosphere. The neutron flux is strongly dependent on key parameters such as

altitude, latitude and longitude. There are high-energy neutrons that interact with the

material generating free electron hole pairs and low energy neutrons. Those neutrons

interact with a certain type of Boron present in semiconductor material creating others

particles, as shown by [9]. Alpha particles are secondary types of particles emitted from

interactions with radioactive impurities present in the device itself or in the packaging

materials and they are the greatest concern. Materials aim to minimize the emission of

alpha particles. However, it does not eliminate the problem completely.

As an energetic particle traverses the material of interest, it deposits energy along

its path, as shown in figure 1-1. This energy is measured as a linear energy transfer

(LET), which is defined as the amount of energy deposited per unit of distance traveled,

normalized to the material's density. It is usually expressed in MeV-cm2/mg. The ionized

track contains equal numbers of electrons and holes. The total number of charges is

proportional to the LET of the incoming particle.

Figure 1-1. Silicon substrate ionization due to an energetic particle hit

Page 6: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 6

The sensitive sites are the surroundings of the reverse-biased drain junctions of a

transistor biased in the off state, as explained by [22]. If an energetic particle passes

through the pn-junction of a CMOS transistor in the off state, a short is momentarily

created between the substrate and the struck drain terminal. The amount of charge that is

collected produces a transient current pulse that lasts until the deposited charge

disappears by recombination or is conducted away via open current paths to VDD or

ground, returning the logic node to its original state.

Figure 1-2 shows a collected charge occurring in the drain junction of the p-

channel transistor. Originally the node held the value ‘0’. As current flows through the

pn-junction of the struck transistor, from the bulk connected to VDD and the drain, the

transistor in the on-state (n-channel transistor in figure 1-2) conducts a current that

attempts to balance the current induced by the particle strike. If the collected charge

induced by the particle strike is high enough that the on-transistor can not balance the

current before the node capacitance is charged, a voltage change at the node will occur.

This voltage change lasts until the charge is conducted away by the current feed through

the on-transistor.

1

Transient

current

+

Vout ! 0

-

Transient

voltage pulse

off

on

Figure 1-2. Charge Collection Mechanism in inverter gate

The maximum charge collection current (Qc) depends on the energy and ion type,

as well as the path length over which the charge is collected. And it correlates with the

energetic particle linear energy transfer (LET) value, as shown:

Qc = (Lth T d e) / X, (1)

Page 7: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 7

where Qc is the collected charge, Lth is the threshold effective LET (in MeV-

cm2/milligram), T is the device thickness (in microns), d is the material density (2.32

g/cm3 for Si), e is the electronic charge = 1.602 x 10-7 pC, X is the energy needed to

create one electron-hole pair (3.6 eV in Si). Replacing d, e and X in the equation (1),

then:

Qc = 1.03 x 10-2 (Lth T ) pC for Si (2)

Considering T 1µm as a reasonable order of magnitude for conventional logic

circuits and LET from 5 to 40 MeV-cm2/mg. Critical charge values range from 50fF to

410fF, obtained by equation (2). These numbers agree with those published by [23], in

silicon, an LET of 97 MeV-cm2/mg corresponds to a charge deposition of 1 pC/µm.

At the electrical SPICE level, the charge deposition mechanism can be modeled

by a double exponential current pulse at the particle strike site, as presented by [45]:

IP(t) = I0 (e-t / τα - e

-t / τβ), (3)

where I0 is approximately the maximum charge collection current, τα is the

collection time constant of the junction and τβ is the time constant for initially

establishing the ion track. In the circuit simulations and modeling, τβ is assumed to be

much smaller than τα, while τα is used as a variable parameter, as shown by [75] and by

[22]. SPICE transient analysis is performed injecting a double exponential current pulse

as given by (3), with the values of I0 and τα being used as the variable parameters to

determine the minimum charge QC corresponding to a given τα. The double exponential

model as given by (3) is proven to be adequate to study the soft error mechanism at the

circuit simulation level [45].

Depending on the fabrication details and the electrical characteristics of each

sensitive node (capacitance and resistance), different shapes of current transients can be

observed as shown by [23, 25]. Figure 1-3 illustrates a double exponential current with a

Page 8: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 8

correspondent amount of charge Qi. The width of the induced transient voltage pulse is

dependent on the energy of the incident particle, the charge stored at the affected node

and the charge collection efficiency of the affected junction. So, according to the

electrical characteristics of the struck node such as resistance and capacitance, different

amplitude and duration of the transient voltage pulse are generated.

There are equation models as the ones proposed by [85] to represent the generated

voltage pulse in each sensitive node according to parameters such as I0, τα, τβ, node

capacitance and resistance. Usually the time duration of the transient voltage pulse in

nanometer technologies ranges from few hundreds of pico seconds to few nano seconds.

As discussed by [42], in designs working in GHz frequencies, some transient voltage

pulses may endure for few periods of clock.

current

time

Charge Qi

QDrift

Qdiffusion

Figure 1-3. The effect of a transient current pulse modeled as a double exponential

current with a certain amount of charge in two different circuit nodes.

Once the values of I0, τα, and τβ are determined for a given technology and

particles of interest, any circuit designed in that technology may be evaluated at the

circuit level by modeling the charge deposition mechanism by (1). The values of I0, τα,

and τβ for a given technology may be obtained by device simulation as well as from

closed form expressions, as presented by [75] and by [22] and [26]. There is a minimal

amount of charge able to create a transient current pulse in a certain node, which is

known as the critical charge. Very often it is important to obtain the critical charge of a

circuit in order to define the environment that the circuit is hardened to. In the next

Page 9: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 9

section, the effects of energetic particle ionization are explained in combinational and

sequential circuits.

1.2 Single Event Effects in Digital ICs A single particle can hit either the combinational logic or the sequential logic in

the silicon generating a soft error, as discussed by [19] and [1]. Figure 1-4 illustrates a

typical circuit topology found in nearly all sequential circuits. The data from the first

latch is typically released to the combinatorial logic on a falling or rising clock edge, at

which time logic operations are performed. The output of the combinatorial logic reaches

the second latch sometime before the next falling or rising clock edge. At this clock edge,

whatever data happens to be present at its input (and meeting the setup and hold times) is

stored within the latch.

Combinational logic

sequential logic sequential logic

Figure 1-4. The occurrence of transient faults in combinational and sequential logics

When a charged particle strikes one of the sensitive nodes of a memory cell, such

as a drain in an off state transistor, it generates a transient current pulse that can turn on

the gate of the opposite transistor. The effect can produce an inversion in the stored

value, in other words, a bit flip in the memory cell. Memory cells have two stable states,

one that represents a stored ‘0’ and one that represents a stored ‘1.’ In each state, two

transistors are turned on and two are turned off (sensitive nodes). A bit-flip in the

memory element occurs when an energetic particle causes the state of the transistors in

the circuit to reverse, as discussed by [8] and [39]. This effect is called Single Event

Upset (SEU) and it is one of the major concerns in digital circuits because usually

Page 10: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 10

memory cells are designed with very compact transistors that present high soft error

sensitivity (low critical charge). The SEU phenomenon is illustrated in figure 1-5.

(a) Static memory cell with a particle strike

1 0

off

off Transient

current

Transient

voltage pulse

(a)

(b)

(c)

(d)

(e)off

off

(b) The induced transient pulse flips the original stored value

Figure 1-5. Single Event Upset (SEU) in a static memory cell

When a charged particle hits the combinational logic block, it also generates a

transient current pulse. This phenomenon is called single transient effect (SET), as

presented in [39]. If the logic propagates the induced transient pulse, then the SET will

eventually appear at the input of a latch, where it may be interpreted as a valid signal.

Whether or not the SET gets stored as real data depends on the temporal relationship

between its arrival time and the falling or rising edge of the clock.

The transient pulse generated by the charge deposition mechanism might not be

captured by a memory cell because it could be logically, electrically or latching-window

masked as discussed by [74] and [51]. Logical masking occurs when the input stimulus

are holding controlled values in the logical path in such a way that the SET can not be

propagated to the outputs. Figure 1-6(a) exemplifies this logical masking. Note that the

Page 11: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 11

output holds the value one, independently to the SET value because the nand gate has one

of the inputs at logical zero and the nor gate presented in the SET path has consequently

one of the inputs at logical one. Electrical masking occurs if the pulse is attenuated as it

propagates through the logic chain and fades out before it reaches the registered output,

as shown in figure 1-6(b). If a SET is either logically or electrically masked, it is

interpreted as a valid signal at the register input and it can be captured by the element

memory according to the latching window (usually based on the setup time and hold time

of the memory element), figure 1-6(c). Once a SET is captured, a wrong value will be

stored in the register provoking a soft error.

e0

e1

e2

a3

Q

10

0 1

(a) Logical Masking

e0

e1

e2

a3

Q

01

1

(b) Electrical Masking

e0

e1

e2

a3

Q

0

clk

1

1

(c) Latch-window Masking

Figure 1-6. Single Event Transient (SET) in a combinational circuit

As a result, the rate at which SETs get latched as errors depends on the operating

frequencies and the logic structure of the circuit. Further, since the inherent delay of

MOS transistors is decreasing with rapid technology scaling, the frequencies at which

circuits are operated is continuously increasing. This increases the probability of SETs

Page 12: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 12

getting latched as errors. In addition, as the process technology shrinks and supply

voltage decreases, the charge stored at logic circuit nodes reduces roughly according to

Qnode = Cnode × Vdd, which is the main reason for the increased sensitivity of nodes to

radiation-induced upsets, as Qc can be larger then Qnode more often. Additional reasons

are the reduction in electrical and timing masking. The impact of the electrical masking

decreases with the technology scaling. This is due to shorter gate delays and reduced

logic depth between pipeline registers. The reduction in timing masking is a consequence

of higher operating frequencies which increases the probability of a SEU pulse being

latched. Thus, in Very Deep Sub-Micron (VDSM) technologies soft errors in logic

circuits are becoming a significant reliability problem.

In [60], [88], [56], [66], the probability of a SET becoming a SEU is discussed.

The analysis of SET is very complex in large circuits composed of many paths.

Techniques such as timing analysis presented by [4], [88], [51], [55], and [20], can be

applied to analyze the probability of a SEU in the combinational logic being stored by a

memory cell or resulting in an error in the design operation. Other techniques based on

formal binary decision diagrams are also proposed in [87].

Multiple bit upsets (MBU) are also becoming a concern because of the process

technology shrinking. MBU can appear due to SETs in nodes with fan-out higher than

one as shown in figure 1-7; or from double node ionizations due to angle of incidence of

the particle, as shown in figure 1-8, which is more common in highly dense memory

arrays.

a0

a1

a2

a3

a4

a5

y0

y1

Q0

Q1

X

X

Figure 1-7. Multiple Bit Upset due to a single SET

Page 13: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 13

+

-

+

-+

-

+

-+

-

+

-+

-

+

-

!

Figure 1-8. Multiple Bit Upset due to an incident angle of the particle

In summary, it is mandatory to investigate techniques able to tolerate SETs and

SEUs in integrated circuits. In the next sections, a set of fault tolerant techniques for

integrated circuits is discussed. The limitations of each technique are addressed. There is

always a drawback to find the most reliable technique with a minimum area and

performance impact. In addition, according to the target design and application, fault

tolerant techniques can be applied at many different steps of the design flow, as it is

presented.

Page 14: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 14

2. Radiation Hardening by Design: Strategies for ASICs

Modifications in the fabrication process technology can reduce the amount of

collected charge, but the reduction is not sufficient to avoid the SET occurrence. The

results published by [21] indicate that significant transients can be generated in both bulk

and SOI technologies at fairly low LETs. In bulk technologies these transients can be

quite large for technologies of 100nm and below, with durations of nearly 1 ns at LET

above 50 MeV-cm2/mg. Consequently, soft error mitigation techniques must still be

applied at different levels of the circuit design flow to ensure reliability.

Figure 2-1 represents the sequence of events that may occur once an energetic

particle hits the substrate, provoking ionization, as it was discussed previously. The

ionization track generates a set of electron-hole pairs that creates a transient current that

is injected or extracted at that node. According to the amplitude and duration of this

current pulse, a transient voltage pulse may appear at the hit node. This is characterized

as the FAULT. There is a FAULT LATENCY period that defines the time needed for that

fault to become an ERROR in the circuit. This will only occur if this transient voltage

node changes the logic of a storage element (flip-flop), generating a bit-flip. This bit-flip

may generate an error if the content of this flip-flop is used for a certain operation. But

from the application point of view, it is not set that this error is manifested as a FAILURE

in the system. There is also an ERROR LATENCY that defines the time needed for that

error to become a failure in the system. For each phase a different fault tolerant

technique can be used. Modern circuits may need fault-tolerance in many different levels

to ensure reliability.

For example, at the ionization and transient current generation phase, sensors can

be built in the silicon substrate to detect ionization currents. The idea at this point is to

notify the system that ionization has occurred. Once a transient voltage pulse is

generated, temporal filtering can be applied to detect the transient pulse in time.

However, the limitations of temporal filtering will be presented later on in this

manuscript. To mitigate the bit-flips, hardware redundancy and error correcting codes can

be used to correct the data. To correct an error, it is possible to use self-checking blocks

Page 15: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 15

with recovery mechanisms or recomputation to restore the correct data. Finally, spare

chips may be used to guarantee operation of the system if a failure occurs.

Figure 2-1. Sequence of events from ionization to failure and a set of fault tolerant

techniques applied at different times.

A set of techniques able to tolerate this entire sequence of events is analyzed in

this manuscript:

Layout and Electrical level based techniques: o Built-in sensors for ionization detection o Transistor resizing for charge dissipation

Logic-level based techniques: o Hardware redundancy for majority voting o Time redundancy for temporal filtering o Error correcting codes for detection and correction of bit-flips in

memory elements o Hardened memory cell for bit-flip avoidance

Architectural level based techniques: o Recomputation

It is important to point out that there is always some penalty to be paid when

protecting circuits against upsets. Each technique may present a combination of area

overhead, performance penalty and power dissipation increase. The challenge is to select

the most suitable techniques for the target circuit application in order to meet the area,

Page 16: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 16

time and power constraints, as well as the soft error hardness needed. In the next sections,

a set of techniques are presented continuing the discussion done by [38].

2.1 Layout- and Electrical-level based techniques

2.1.1 Bulk Built-in Current Sensors Built-in current sensors have been used for permanent fault detection, where the

permanent faults typically originated due to imperfections in the integrated circuit

fabrication process as presented in [5]. It is well-known that stuck-at faults can change

the amount of current consumed by a circuit, so BICS connected to the power lines can

detect current variations and consequently relay the occurrence of permanent faults.

However, soft errors, which are one of the major concerns nowadays, have a transient

effect and consequently, they do not present current variations at the power lines that can

be distinguished from any other circuit activity. The source of the effect is a transient

ionization that can only be seen at the hit node or at the bulk region. For this reason,

BICS connected to the power lines cannot help on soft error detection as is, but BICS

connected to the bulk region can sense the ionization.

As discussed above, during normal circuit operation the current flowing between

a reverse biased drain junction and bulk is negligible, if compared to the current peak

induced by an energetic particle hit. Consequently, it is cost-effective to think about a

BICS connected to the bulk of a circuit, instead of connecting it to the power lines of a

circuit. The bulk-BICS works as monitor that senses the current at the bulk terminal.

During normal operation, the current in the bulk is approximately zero. Only the leakage

current flows through the biased junction, which is still very low compared to the current

generated by charged particles. So, when a charged particle generates a current in the

bulk, it is very clear to the bulk-BICS that a SET has happened.

Figure 2-2 (a) shows the bulk-BICS connected to an integrated circuit as proposed

by [29]. For the bulk-BICS approach, it is necessary to have a dedicated BICS in each

type of well (N-well, P-well), consequently one BICS design is used for PMOS

transistors in the N-well and another BICS design is used for NMOS transistors in the P-

well. In addition, the possibility of distinguishing upsets that occur in the PMOS region

(BICS-P output) from the ones in the NMOS region (BICS-N output) can help to

Page 17: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 17

precisely map the faulty region in the circuit design. Each bulk-BICS can detect

ionizations in a certain number of transistors, where this number is determined by the

designer considering the SET-detection sensitivity. For a certain circuit with n number of

transistors, it is necessary i number of bulk-BICS, where each bulk-BICS is connected to

n/i transistors. Figure 2-2 (b) depicts the connection of the bulk-BICS to the body ties.

The circuit itself is connected to the power lines (at the transistor sources), while the body

ties are connected to VDD or ground through the bulk-BICS.

Circuit

Design

Vdd’

Gnd’

Vdd

Vdd

BICS-P

n1 n2

n4 n3

n5

p4

n6

p6p5

p1 p2

p3

nRST

RST

BICS-N

Vdd

(a) Schematic of the Bulk-BICS

(b) Bulk-BICS sensors placed at the silicon substrate, body-tie is connected to

VDD or ground through the bulk-BICS

Figure 2-2. The N-BICS and P-BICS connected to the bulk of a integrated circuit, as presented in [29].

In the case of N-well, the body-ties are connected to VDD through the bulk-BICS,

while in the case of P-well, the body-ties are connected to ground through the bulk-BICS.

Page 18: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 18

The bulk-BICS can be calibrated to detect ionizations that that can produce

transient current pulses at the struck node. A SET is assumed to occur if the voltage of

the logic gate output node changes by more than VDD/2. This bulk-BICS technique

presents a conservative approach for SET detection.

Figure 2-3 shows the temporal diagram of a SET detected by the bulk-BICS.

Once a SET occurs at any moment in a clock cycle, the bulk-BICS detects this SET after

a certain delay, called SET detection time, that depends on the amount of area

(transistors) protected by that bulk-BICS and on the size of the SET. The more intense

the SET (large I0 and τα), the faster the SET detection by the bulk-BICS occurs. The more

transistors connected to one BICS, the larger the capacitance associated with that

connection and consequently, the SET detection time is longer.

Once a SET is detected, the output of the bulk-BICS is raised, which notifies a

control logic in the circuit to perform some fault tolerant technique to tolerate the

detected SET and to reset the bulk-BICS.

bulk-BICS

Vdd/2SET

clk

bulk-BICS_ctrl

reset_BICS

1

2

3

delay

Figure 2-3. Bulk-BICS time diagram for detection and reset

2.1.2 Transistor Resizing for Charge Dissipation Digital circuits have different resistance and capacitance values at each

gate node according to its fan-out and gate logic type, consequently, each node presents a

distinct critical charge (Qcrit), which is the minimum collected charge needed to provoke a

SET or SEU at that node. When a soft error analysis tool is used (such as the ones

Page 19: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 19

referred to previously), the probability of SET occurrence in a certain design is evaluated.

So, it is possible to draw the most sensitive nodes, which will be the ones that present a

higher chance of propagating a SET to the outputs (low logic and electrical masking) and

a low Qcrit.

The idea of transistor resizing is to enlarge the width of some transistor in order to

increase the capacitance of the most sensitive nodes in such a way that the node critical

capacitance is increased. It is not desirable to increase all node capacitance because this

would make the circuit slow and high power consuming. So, it is important to analyze the

circuit sensitivity to soft errors in order to choose the nodes that are going to be modified.

Some recent works such as the ones published by [89], [17], and [20], showed the

variation of Qcrit as a function of the transistor channel widths and therefore have

presented results about the decreasing of SET sensitivity by applying transistor resizing.

The transistor resizing can also be replaced by gate duplication as proposed by [56].

Figure 2-4 presents an example of a circuit with the three most sensitive nodes to SET.

By eliminating the chance of a SET occurrence in these nodes, the sensitivity to SET of

the entire circuit reduces by 50%. In order to determine the size of the transistors (node

capacitance and resistance) that is able to mitigate a certain range of energetic particles,

the model equations applied for the calculation of the critical charge node and for the

SET generation [85] can be used. The challenges of this technique are: (a) keeping the

circuit time requirements when increasing the transistor sizing of the most sensitive nodes

and (b) finding a transistor size with a critical charge that is able to avoid SET for a range

of LET. It is clear that this method is suitable for low LET such as alpha particle LET,

which is around 2 MeV-cm2/mg and neutrons up to 2 MeV-cm2/mg.

Page 20: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 20

A

B

C

D

E

F

Z

most sensitive nodes

Figure 2-4. Transistor Resizing

2.2 Logic-level based techniques The logic-level based techniques are all fault tolerant techniques that can be easily

applied at the gate level to tolerate soft errors (SET in combinational circuits and SEU in

sequential circuits). The logic-level based techniques can be applied in hardware

description level languages such as VHDL and Verilog or at the schematic description

level. Techniques will be presented based on hardware and time redundancy, the

hardened memory cells and error correction codes for information redundancy. As will be

discussed in the next sections, some of these techniques are able to mitigate SET and

SEU, others only SET and others only SEU.

2.2.1 Hardware redundancy techniques Redundancy has always been successfully used to detect and vote out errors of the

logic. The first basic approach is duplication with comparison (DWC), where the module

is replicated and the outputs are compared. If the outputs mismatch, an error is detected.

Of course, some errors can be masked by the application so the error is only detected

when it manifests a wrong output value.

This scheme can be used for both combinational and sequential logic to SET and

SEU detection, respectively, as presented in figure 2-5. It can also be applied for the

entire circuit. It is common to have two processors executing the same task to detect

errors in one of the two chips.

Page 21: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 21

However, the comparator is the key circuit because it expects to detect the error

and be immune of error as well. Usually comparators can be designed with larger

transistors in order to be less sensitive to upsets and they can also be duplicated.

Figure 2-5. Duplication with comparison scheme

However, duplication with comparison can only notify the circuit that an error is

present, it can not inform which module or piece of logic has the error. A self-checking

circuit can be used to detect an error. For example, parity checking in arithmetic logic

functions. In this case, a hot backup approach can be used, as illustrated in figure 2-6(a).

There is the main module (module 0) and the spare module (module 1). The output by

default receives the module 0 output. But if an error is detected in this module by the self-

checking block, then the output receives the module 1 output that is supposed to be fault

free. On the other hand, the self-checking block can be very difficult to design and very

often the checker can have the same complexity of the block that it must check. So, the

duplication with backup approach is also very commonly used, as shown in figure 2-6

(b). Module 0 and module 1 work in tandem and their outputs are continuously

compared. If an upset is detected, then the output receives the module 3 output, which is

the spare module and it is supposed to be fault free. The only problem is how to ensure

that the spare module is fault free. To overcome this issue, modular redundancy with

majority voters (MAJ voters) can be used.

Page 22: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 22

Module 0

Module 1

Spare

Self-checkingin

out

Module 0

Module 1

Spare

in

Module 2

out

(a) Hot backup approach (b) Duplication with backup approach

Figure 2-6. Hot backup and Duplication with backup approaches

In order to be able to detect and vote the correct output, it is necessary n

redundant elements; when n typically is an odd number equal or larger then 3. This

approach is called N-modular redundancy (N-MR). The triple modular redundancy

(TMR) is the most common approach. It requires three modules working in tandem and a

majority voter (MAJ voter) to vote the correct output. When an upset is presented, it is

expected that at least two out of three outputs are correct, so the vote can decide the

correct output. Figure 2-7 illustrates this approach used for sequential elements (flip-

flops).

Figure 2-7. Triple Modular Redundancy (TMR) in the sequential logic and the

majority voter (MAJ)

However, there are two main limitations in soft error protection when using only

TMR in the sequential logic as presented in figure 2-8 and figure 2-9. The first limitation

is the SET in combinational logic can be stored in all three flip-flops at the same time,

which makes the majority voter choose a wrong output (figure 2-8). The second

limitation is when the SET occurs in the majority voter. This SET can be propagated and

Page 23: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 23

latched by the three flip-flops later on the circuit, as presented in figure 2-9. In both

cases, the MAJ voter chooses the wrong output because 3 out of 3 values are wrong.

Figure 2-8. SET propagation in a TMR scheme in the sequential logic

Figure 2-9. SET propagation in the majority voter (MAJ)

In order to solve this problem, the full TMR is proposed. In this case, the

combinational logic and the voters are also triplicate as shown in figure 2-10. If a SEU

occurs in one of the flip-flops, the MAJ voter chooses the correct output, as shown in

figure 2-10(a) and at the next clock the correct output can be loaded to the flip-flop

clearing the SEU. If a SET occurs in one of the combinational logic blocks, the SET may

be captured by only one of the flip-flops and the MAJ voter will be able to choose the

correct output, as represented in figure 2-10(b). If a SET occurs in one of the MAJ voters,

the voter output will show the transient for a short period of time, as shown in figure 2-

10(c). But since all the circuit is triplicate, only one redundant part is affected and the

SET will be voted out at the next MAJ voter. Usually the three voter outputs can be

connected outside the chip as shown in figure 2-10(d). This scheme is kind of analog

voter. So, even if an upset occurs in one of the voters, the currents at the output will

provide the correct output.

TMR presents two weaknesses: (a) it does not protect against double faults

simultaneously affecting different redundant modules, of which the probability of

occurrence has increased in the nanometer technologies as discussed by [70]; and (b) a

Page 24: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 24

single fault in the last voter itself can generate undetected errors as shown by [43]. Even

when using TMR for the voters as well, a fourth voter is always needed to choose the

correct output of the circuit, the last voter in the chain, even with a lower probability of

producing an error due to a SET, will always be subject to this problem.

(a) SEU Mitigation

(b) SET Mitigation in the logic

(c) SET in the voters

Page 25: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 25

OUT

voter

voter

voter

chipboard

(d) Output voter

Figure 2-10. Full Triple Modular Redundancy (TMR) with self-recovery

In [71], an analog voter has been proposed to ensure complete tolerance against

SET in TMR solutions. This voter, shown in figure 2-11, uses an analog comparator,

instead of the traditional digital sum of products, to decide the output value. The

robustness of the proposed analog majority voter relies in three main points: duplicated

input nodes, well-dimensioned transistors in the analog comparator and output transistors

always in the on state.

module 0

Majority logic voter

(MAJ voter)

module 1

module 2

+-

VDD/2+

Figure 2-11. Full Triple Modular Redundancy (TMR) with Analog Majority Voter

Consequently, if a SET occurs at one of the inverter transistors, there is always

another transistor, in parallel, to ensure the correct value. The same happens if a SET

occurs inside the analog comparator logic; the transistors are set in a way that the SET

will not be able to turn on or off other transistors, holding the correct value at the output.

Finally, if a SET occurs at the output node, that transistor is already conducting a current

and the additional current generated by the transient pulse does not change the output

Page 26: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 26

state and, therefore, does not harm the operation of the circuit. A more detailed analysis

of an analog comparator behavior presented on SET can be seen in [48].

The schematic diagram of the analog comparator used to implement the MAJ

voter and the dimensions of the transistors, implemented using a 32 nm technology, as

proposed by [46] are shown in figure 2-12. The six inverters connecting the inputs to the

comparator, shown in figure 2-11, were implemented using PMOS transistors with

W=144nm and L=32nm, and NMOS transistors with W=80 nm and L=32nm.

M4

VrefVin

Out

M1 M2

M3

M7

M6

M5

M4

VrefVin

Out

M1 M2

M3

M7

M6

M5

Figure 2-12. The schematic of the Analog Comparator used in the Analog Majority Voter and the W/L ratio of each transistor for the 32nm CMOS technology using

the predictive model from Berkeley.

The analog majority voter can also be used as the basic gate logic to implement

Boolean functions. As presented in [46], any combinational logic can be logic mapped to

a tree of analog majority voters, which makes each node of the circuit robust to SET. By

using this technique, the circuit can tolerate multiple SETs. Figure 2-13 illustrates a one-

bit full adder mapped to the analog MAJ voters.

MAJ

voter

coutA

B

Cin

MAJ

voterMAJ

votersum

MAJ

voter

!A

B

!Cin

0

0

MAJ

voter

!A

B

!Cin

1

1

A

B

Cin

MAJ

voter

coutA

B

Cin

MAJ

voterMAJ

voter

MAJ

votersum

MAJ

voter

!A

B

!Cin

0

0

MAJ

voter

!A

B

!Cin

0

0

MAJ

voter

!A

B

!Cin

1

1

MAJ

voter

!A

B

!Cin

1

1

A

B

Cin

Figure 2-13. One-bit Full Adder implemented by only Analog Majority Voters

Page 27: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 27

In summary, hardware redundancy techniques such as full TMR for

combinational and sequential logic and the solution of mapping Boolean logic functions

to analog majority voters can protect the circuit against SET and SEU. The drawback of

these techniques is the area overhead and consequently power dissipation.

2.2.2 Time redundancy techniques Time redundancy techniques are solutions able to process the data at different

times, which allows the detection of faults [47]. The most simple scheme is when two

flip-flops controlled by a clock and a delayed clock are used to latch the combinational

output at two different times, which allows the detection of a SET, as shown in figure 2-

14. The delay (d) must be chosen according to the SET time duration that must be

detected. Let one suppose that the larger SET that can occur in this circuit has duration

of 600ps. So, the delay (d) must be at least of 600ps to allow one flip-flop capture the

correct data while the other one captures the SET. This scheme can only detect SET but

not vote the correct output.

Figure 2-14. Time redundancy scheme for SET detection

Full time redundancy is when the output is captured three different times, which

allows a majority voter to choose the correct output. In this case, the output of the

combinational logic is latched at three different moments, where the clock edge of the

second latch is shifted by the time delay d and the clock of the third latch is shifted by the

time delay 2.d. A voter chooses the correct value, as shown in figure 2-15.

Page 28: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 28

Figure 2-15. Full time redundancy scheme

This technique is also able to vote SET that occurs in the MAJ voter as presented

in figure 2-16(a) when the MAJ voter is not the very last one, figure 2-16(b).

(a) SET in MAJ voters

(b) SET in the very last MAJ voter

Figure 2-16. SET propagation in the MAJ in time redundancy schemes

However this time redundancy technique based on clock delay does not work for

SET mitigation in nanometer technologies when the SET has large pulse durations

compared to the clock period. Figure 2-17 shows the problem in a time diagram. Let us

consider a 90nm technology working at 1 GHz. The maximum delay time between two

registers separated by a combinational logic is 1ns, which is the period of the clock. For

this technology, as discussed previously, SETs can vary from a few hundred pico seconds

to nano seconds. Let one suppose that SET pulses up to 600ps must be tolerated. So, the

Page 29: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 29

clock delay d must be 600ps at least. Then, the first clock (clk) occurs at time t=0, the

second clock (clk+d) occurs at t=600ps and the third clock (clk+2.d) occurs at time

t=1,200ps. After storing the data in all three latches, the MAJ voter chooses the correct

output, which also presents a propagation delay. Consequently, it is necessary 1,200ps

plus the MAJ voter propagation delay to vote out the SET, not counting the

combinational logic propagation delay. The new achieved frequency is less than half of

the original frequency. This shows that this method is suitable only for SETs with time

duration not higher than 10% of the clock period. However, it is well known that for

nanometer technologies, SETs are at the same order of magnitude as the clock period, as

discussed by [21], [26] and [42]. So, new time redundancy techniques must be

investigated.

clk

clk+d

d

comb

clk+2d

d

SET

ffp0

ffp1

ffp2

MAJ

MAJ + comb delays

T

clk

clk+d

d

comb

clk+2d

d

SET

ffp0

ffp1

ffp2

MAJ

MAJ + comb delays

T (a) short duration SET (b) long duration SET Figure 2-17. SET propagation in the MAJ in time redundancy schemes

Page 30: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 30

2.2.3 Mixed Hardware and Time Redundancy Techniques Mixed hardware and time redundancy techniques attempt to mitigate SET and

SEU by using the best characteristics of hardware and time redundancy in order to meet a

lower area overhead and at the same time a lower performance penalty.

The code word state preserving (CWSP) proposed by [2, 3], as illustrated in

figure 2-18, is an example of mixed hardware and time redundancy. From the hardware

redundancy point of view, it has redundant combinational logic and extra transistors in

the very last gate stages. From the time redundancy point of view, the output is only

transmitted to the flip-flop input when both combinational logic outputs agree. So, if a

SET occurs at one of the combinational logics, the flip-flop input has a high impedance

value, while the SET is still on.

a

a*

b*

bCombinational

logic

Combinational

logic

CWSP

… clk+delay

(a) General CWSP scheme

(b) Example of logic gates with extra transistors to block the SET propagation

Figure 2-18. Code Word State Peserving technique for SET mitigation

Page 31: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 31

This technique does not need voters or comparators but it has an asynchronous

behavior because it is not possible to determine when both combinational outputs are

ready to be stored in the flip-flop. Consequently, the clock period cannot be fixed. Also,

the very last transistors of the logic are sensitive to SET, so they must be sized in order to

be less sensitive than the others.

Figure 2-19 presents the time diagram of this technique showing the limitations of

this technique for SET with long pulse duration. Note that in figure 2-19(b) the flip-flop

will store the high impedance value as the two combinational output values are not yet

agreeing with each other.

clk

a

SET

a*

out `Z`

clk

a

SET

a*

out `Z`

t t

(a) short duration SET (b) long duration SET

Figure 2-19. Time diagram of CWSP approach

The technique proposed by [50], uses a hardware redundancy for the

combinational logic and for the register circuit, with a C-element and a keeper circuit at

the latches outputs, as shown in figure 2-20. If a SEU occurs in one of the latches, as

illustrated in figure 2-20(a), the C-element does not propagate the upset and the keeper

element is able to maintain the output value. Note that the C-element in this case inverts

the values stored at the latches. This can ensure the correct value at the output (OUT) in

the presence of SEU.

Page 32: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 32

If an upset occurs in the combinational logic, then, one latch registers the SET

while the other one registers the correct value as shown in figure 2-20(b). When Clock is

equal to one there will be time that the C-element propagates the correct value and there

is a time when C-element blocks the propagation because the latch values mismatch. This

phenomenon is illustrated in figure 2-21(a). However, for long pulse SET, the C-element

may never propagate the correct value as the latches may stay diverging for the entire

period when clock is equal to one, as seen in figure 2-21(a) on the left. In this case, the

output (OUT) may hold a previous clock cycle value, which compromises the

synchronization of the circuit.

The redundant logic can be replaced by a delay (time filtering) as shown in figure

2-20(c). In this case, the two latches may store the SET at different times as represented

in figure 2-21(b). The problem occurs when a long pulse SET happens, because in this

case the two latches can hold the SET values at the same time making the C-element

propagate the wrong value, which will be kept by the keeper circuit. In summary, this

technique works well for SEU but it cannot protect properly upsets like SET.

(a) SEU in the latches

Page 33: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 33

(b) SET in the combinational logic when using logic duplication

(c) SET in the combinational logic when using time filtering

Figure 2-20. Hardware and Time redundancy with C-element [50]

This technique, as the ones presented previously, is inadequate for long duration

pulse SETs. When long SET pulses occur, the output holds the previous value or even the

wrong value, which can compromise the synchronization of the circuit. A solution for

mitigating long SET pulses may be based on recomputation combined with a low cost

technique able to detect the SET and SEU faults.

Page 34: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 34

c_out0

SET

c_out1

OUT Previous

value

C-element propagates the correct value

Keeper holds the value

C-element propagates the correct value

c_out0

SET

c_out1

OUT Previous

value

Keeper holds the value

clock clock

(a) Time diagram for SET in the combinational logic when using logic duplication for short and long SET pulses

c_out

SET

c_out+!

OUT Previous

value

Keeper holds the value

C-element propagates the correct value

c_out

SET

OUT Previous

value

Keeper holds the value

clock clock

SET

c_out+!

SET

C-element propagates the wrong value (b) Time diagram for SET in the combinational logic when using time filtering for

short and long SET pulses

Figure 2-21. Time diagram from the technique presented in [50]

2.2.3 Hardened Memory Cells Memory elements can be protected against SEU (bit-flip) by modifying their

original design with extra resistors or transistors, able to recover the stored value if an

upset strikes one of the drains of a transistor in “off” state. These cells are called

hardened memory cells, and they can avoid the occurrence of a SEU by design, according

to the particle charge and flux.

Page 35: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 35

In order to better understand how these hardened memory cells work, let one start

with the analysis of a standard static memory cell composed of 6 transistors. When a

memory cell holds a value, it has two transistors in “on” state and two transistors in “off”

state; consequently there are always two SEU sensitive nodes in the cell. When a particle

strikes one of these nodes, the energy transferred by the particle can provoke a transistor

to switch “on”. This event will flip the value stored in the memory. If a resistor is inserted

between the output of one of the inverters and the input of the other one, the signal can be

delayed for such a time as to avoid the bit flip.

The SEU tolerant memory cell protected by resistors proposed by [83] for ASICs

and by [68] for FPGAs was the first proposed solution to this matter, figure 2-22(a). The

decoupling resistor slows the regenerative feedback response of the cell, so the cell can

discriminate between an upset caused by a voltage transient pulse and a real write signal.

It provides a high silicon density, for example, the gate resistor can be built using two

levels of polysilicon. The main drawbacks are temperature sensitivity, performance

vulnerability in low temperatures, and an extra mask in the fabrication process for the

gate resistor. However, a transistor controlled by the bulk can also implement the resistor

avoiding the extra mask in the fabrication process. In this case, the gate resistor layout

has a small impact in the circuit density.

Memory cells can also be protected by an appropriate feedback devoted to restore

the data when it is corrupted by an ion hit. The main problems are the placement of the

extra transistors in the feedback in order to restore the upset and the influence of the new

sensitive nodes. Examples of this method are IBM hardened memory cells proposed by

[69] in figure 2-22(b), HIT cells in figure 2-22(c) proposed by [12, 79], DICE cells in

figure 2-22(d) proposed by [14] and NASA memory cells proposed by [84, 44, 15],

represented in figure 2-22(e). The main advantages of this method are temperature,

voltage supply and technology process independence, and good SEU immunity. The

main drawback is silicon area overhead that is due to the extra transistors and their extra

size.

Page 36: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 36

c

d /d

/qq

R

R

clk

D /D

/QQ

clk

PE PF

PA PB

PC PD

A

B

C

Vss Vss

Vdd Vdd

Vdd Vdd

Vdd Vdd

N1 N2

P1 P2

N3 N4

(a) Resistor memory cell (b) IBM hardened memory cell

Vdd

Vdd Vdd

Vdd

Vss Vss

D /D

clk

Q /Q

M L

MP1 MP2

MN1 MN2

MN5 MN6

MN4MN3

MP4

MP3

MP6

MP5

/D

MN0 MN1 MN2 MN3

clk

MN6MN5MN4 MN7

D

MP0 MP1 MP2 MP3

A B C D

Vss Vss Vss Vss

Vdd Vdd Vdd Vdd

(c) HIT memory cell (d) DICE hardened memory cell

D

D /Q

Q

/clk

clk

Vss Vss (e) NASA hardened memory cell

Figure 2-22. Examples of SEU hardened cells

Page 37: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 37

2.2.4 Error Correcting Code (ECC) Error correcting code technique is based on information redundancy and it is used

to mitigate SEU in integrated circuits, as discussed by [18]. It is usually used in memory

arrays, but it can be also applied in registers or other small memory structures in

microprocessors, for instance. Designers can implement ECC detection and correction as

hardware or software. [73] compares the reliability of ECC implemented in these two

levels of approaches. The simplest error correcting codes can correct single-bit errors and

detect double-bit errors while more complex ones can detect or correct multi-bit errors.

Examples include Hamming code, BCH code, Reed-Solomon code, Reed-Muller code,

Binary Golay code, convolutional code, and others. Simple codes are usually

implemented in hardware using extra memory bits and encoding/decoding circuitry.

Figure 2-23 illustrated an 8-bit data being written and read from a register. If an

SEU occurs and there is no information redundancy, an error occurs but it is not detected,

which can lead in catastrophic consequences in the circuit. If a parity bit is added to the

stored data, it is possible to detect an error when the parity bits mismatch. For many

applications, it is not enough to detect the error, but it is necessary to correct it. For those,

it is possible to use an ECC code with encoder and decoder blocks. The encoder block

creates a set of check bits that will help to identify the error position, and then the

decoder block is able to restore the correct value.

1

0

1

0

1

0

0

0

1

1

1

0

1

0

0

0

write read

1

0

1

0

1

0

0

0

error

1

0

1

0

1

0

0

0

1

1

1

0

1

0

0

0

write read

1

0

1

0

1

0

0

0

Error detected!

00P= P= 1

Parity does not

match

1

0

1

0

1

0

0

0

1

1

1

0

1

0

0

0

write read

1

1

1

0

1

0

0

0

1 1

en

co

de

r

0

1

0

de

co

de

r

Error corrected!

Figure 2-23. Error Correcting Code Principle

An example of largely used ECC is the Hamming code [31] in its simplest

version. It is an error-detecting and error-correcting binary code that can detect all single-

Page 38: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 38

and double-bit errors and correct all single-bit errors (SEC-DED). This coding method is

recommended for systems with low probabilities of multiple errors in a single data

structure (e.g., only a single bit error in a byte of data). The code satisfies the relation 2k

≤ m+k+1, where m+k is the total number of bits in the coded word, m is the number of

information bits in the original word, and k is the number of check bits in the coded

word. Following this equation the hamming code can correct all single-bit errors on n-bit

words and detect double-bit errors when an overall parity check bit is used.

The hamming code implementation is composed of a combinational block

responsible for encoding the data (encoder block), inclusion of extra bits in the word that

indicate the parity (extra latches or flip-flops) and another combinational block

responsible for decoding the data (decoder block). The encoder block calculates the

check bits and it can be implemented by a set of n-input XOR gates. The decoder block is

more complex than the encoder block, because it needs not only to detect the fault, but it

must also correct it. It is basically composed of the same logic used to compose the check

bits plus a decoder that will indicate the bit address that contains the upset. The decoder

block can also be composed of a set of n-input XOR gates and some AND and

INVERTER gates.

The encoder block calculates the check bits that are placed in the coded word at

positions 1, 2, 4, …, 2(k-1). For example, for 8-bit data, 4 check bits (p1, p2, p3, p4) are

necessary, so that the hamming code is able to detect and correct a single-bit error (SEC-

SED). Figure 2-24 demonstrates a 12-bit coded word (m=8 and k=4) with the check bits

p1, p2, p3 and p4 located at positions 1, 2, 4 and 8 respectively. The check bits are able to

inform the position of the error. The check bit p1 creates even parity for the bit group {1,

3, 5, 7, 9, 11}. The check bit p2 creates even parity for the bit group {2, 3, 6, 7, 10, 11}.

Similarly, p3 creates an even parity for the bit group {4, 5, 6, 7, 12}. Finally, the check

bit p4 creates even parity for the bit group {8, 9, 10, 11, 12}.

Page 39: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 39

Encoder block: check bits generation

P1 = W3 xor W5 xor W7 xor W9 xor W11

P2 = W3 xor W6 xor W7 xor W10 xor W11

P3 = W5 xor W6 xor W7 xor W12

P4 = W9 xor W10 xor W11 xor W12 Decoder block: syndromes

Syndrome P1 = P1 xor W3 xor W5 xor W7 xor W9 xor W11

Syndrome P2 = P2 xor W3 xor W6 xor W7 xor W10 xor W11

Syndrome P3 = P3 xor W5 xor W6 xor W7 xor W12

Syndrome P4 = P4 xor W9 xor W10 xor W11 xor W12 Decoder block: mask generation

Syndrome Mask

P4P3P2P1 P1P2W3…….W11W12

0000 no error

0001 100000000000

0010 010000000000

0011 001000000000

0100 000100000000

0101 000010000000

0110 000001000000

0111 000000100000

1000 000000010000

1001 000000001000

1010 000000000100

1011 000000000010

1100 000000000001

Figure 2-24. Hamming code check bits generation for an 8-bit word, 12-bit coded word

Page 40: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 40

Hamming code can protect structures such as registers, register files and

memories, as presented in figure 2-25. According to the organization, one encoder and

one decoder can be multiplexed in time to protect many registers and memory elements.

Hamming code increases area by requiring additional storage cells (check bits), plus the

encoder and the decoder blocks. For an n bit word, there are approximately log2 (2.n)

more storage cells. However, the encoder and decoder blocks may add a more significant

area increase, thanks for the extra XOR gates. Regarding performance, the delay of the

encoder and decoder block is added in the critical path. The delay gets more critical when

the number of bits in the coded word increases. The number of XOR gates in serial is

directly proportional to the number of bits in the coded word.

Encoder

Decoder

word

check bits

refreshingdata

Encoder

encoder

decoder

decoder

words

check bits

Refreshing logic

data

WR

RD

(a) Registers protected by ECC (b) Memory protected by ECC

Figure 2-25. ECC in memory elements with a feedback refreshing to clean up the SEUs

In [41], it was proposed a microcontroller protected by Hamming code. The ECC

was implemented in the datapath, memory arrays and control logic. Figure 2-26 shows

the microcontroller datapath.

Page 41: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 41

Datapath

All the registers are 12-bit

(coded by Hamming Code)

decoder decoder

encoder

decoder

encoder add/sub PC

ALU

RAM

memory

PCdecoder

ROM data

decoder

decoderAD_low

AD_high

12-bit data

ROMmemory

encoder

decoder

AD

data

refreshing

Figure 2-26. Example of a microcontroller datapath protected by ECC

The limitation of hamming code is that it can not correct double bit upsets, which

can be very important for very deep sub-micron technologies, especially in memories

because of the high density of the cells [65]. Other codes must be investigated to be able

to cope with multiple bit upsets, which probability of occurrence is increasing due to the

advance in technology as shown in [13]. Reed-Solomon [31] is an error-correcting coding

system that was devised to address the issue of correcting multiple errors. It has a wide

range of applications in digital communications and storage. Reed-Solomon codes are

used to correct errors in many systems including: storage devices, wireless or mobile

communications, high-speed modems and others. Reed-Solomon (RS) encoding and

decoding is commonly carried out in software. But an efficient RS code implementation

in hardware was presented by [53, 54], to protect memories against multiple SEUs.

When using ECC, it is appropriate to implement interleaving technique, which

means that the bits of a word protected by the same check bits must not be placed

physically on side to each other. This technique helps ensure that no upset of two nearest-

neighbor memory cells resides in the same check word, which can make multiple bit

upsets in a single bit ECC. [54] proposed a memory interleaving organization where two

ECC are used: Reed-Solomon and Hamming code to ensure correction in presence of

massive multiple upsets, figure 2-27.

Page 42: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 42

Figure 2-27. Example of interleaving in a memory protected by two ECCs [54]

2.3 Architectural level based techniques Whenever the effect of a fault is detected in the circuit architecture, this means

that the circuit is already computing with some error and in this case it is mandatory to

have a computational recovery. Current microprocessors already maintain checkpoints

across 10’s of instructions for purposes of speculation recovery, as discussed by [82].

This makes suitable apply fault recovery in nowadays microprocessors.

There are two general principles for recovery: forward and backward. Forward

error recovery means detecting an error and continuing on in time, while attempting to

mitigate the effects of the faults that may have caused the error. It implies the

constructive use of redundancy. For example, temporally or spatially replicated

messages, may be averaged or compared to compensate for lost or corrupted data.

Backward error recovery means detecting an error and retracting back to an earlier

system state or time. It includes the use of error detection (by redundancy, comparing

pairs or error-detecting codes) so evasive action can be taken. It subsumes rollback to an

earlier version of data or to an earlier system state. Backward error recovery may also

include fail-safe, fail-stop and graceful degradation modes that may yield a safe state,

degraded performance or a less complete functional behavior. There can be problems if

Page 43: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 43

backward recovery is used in real-time systems. One problem is that interactions with the

environment cannot be undone. Another problem is how to meet the time requisites.

In summary, each forward and backward recovery approaches have advantages

and drawbacks. While forward recovery can meet time requisites but it needs inherent

redundancy for error detection and redundant computation, backward recovery needs

only error detection but not inherent redundancy of the entire computation, because once

an error is detected, the computation is going to be performed again by the same

hardware, but it is hard to meet time requisites sometimes with this approach.

So the challenge is to efficiently detect the fault effect. One of the first option is

the concept of processing and checking in parallel the outputs of a system for only a

subset of its possible inputs, also called fingerprinting as presented by [6], can be applied

to the general case of a circuit that must be hardened against soft errors, thus providing

tolerance against transient faults caused by pulses that affect parts of the circuit, even

when the duration of the transient pulse is longer than the delay of several gates. In [49],

it is also presented concurrent error detection techniques for combinational logic blocks

in order to detect an error. Figure 2-28 illustrates the general concept for using

infrastructure IP to check logic-block integrity. Checking a logic function requires

predictor block to compute the input signature and a checker to compare the output and

input signatures. The challenge is in designing and implementing the most efficient

blocks while preserving performance and keeping cost down.

Function f

Output

Characteristic

Prediction

checker

error

output

inputs

Figure 2-28. Fingerprinting Technique

Page 44: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 44

In contrast with other proposed solutions based on checker circuits when

fingerprinting is applied, the random checker does not provide full fault detection, figure

2-29. In this case, the random checker performs some of the functions of the main circuit

only on a small set of possible inputs, being able to statistically detect errors at the output

with a given probability. The main goal of this approach is to provide an acceptable level

of fault detection using a circuit that is significantly smaller than the main circuit under

inspection, thereby providing low area overhead. The underlying concept presented here

is generic, and can be adopted for several different applications or circuits, with the

subset of inputs, the operations performed by the checker, the performance, area, and

power overheads varying according to the application.

main circuit

random checker

inputs output

error

Figure 2-29. Random checker technique

Typically, embedded systems in safety critical applications use watchdog

schemes, which will detect an erroneous behavior after a long series of clock cycles

under worst case conditions. Subsequently, the error is repaired by interrupt and retry.

For applications which are also time-critical, such methods are too slow. If a roll-back is

done several clock cycles after error detection, the system may have had the time to write

large amounts of wrong date back into the system memory before compensation.

Micro rollback tries to recognize an error condition very early and provides error

correction within a few clock cycles. In [27], it was proposed a micro rollback scheme

based on two separate processors, whereby the backup processor (trailer) is one or two

clock cycles delayed, but performs exactly the same operations as the master processor.

Micro rollback re-stores the last error-free processor state by re-loading of all register

contents and re-executes the erroneous instruction on the master processor. In this

scheme, the trailer processor holds register contents long enough to re-establish the

original status. Such an approach is based on the implicit assumptions that the trailer

Page 45: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 45

processor is self checking in order to identify the fault device. The trailer as acts as a

backup in case of transient faults. Figure 2-30 exemplifies the micro-rollback, as

presented by [30].

Figure 2-30. Micro-rollback Example

2.4 Area and Performance Tradeoffs Summary Each technique discussed previously here presents a different area and

performance overhead. It is possible to choose the most efficient one or combine them to

achieve the fault tolerance requirement for each type of application and system platform.

Table 2-1 shows the area and performance of each presented technique when

implemented in an 8-bit adder design with registered inputs and output, which contains

294 transistors to implement the combinational logic and 384 transistors to implement the

32 master-slave flip-flops. The results in area are computed for 90nm technology (PTM

model). The performance is shown in a general form equation based on the setup and

hold times and propagation delay of the flip-flops (tpffp), the propagation delay of the

adder (tplogic) and the propagations delays of the added blocks such as comparators

Page 46: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 46

(tpcomparator), voters (tpvoters), encoding (tpenc) and decoding (tpdec) blocks and

others.

Table 2-1. Comparison of Area and Performance in an 8-bit adder case-study circuit with registered inputs and outputs protected by SEE mitigation techniques.

Technique Area Performance Fault Tolerance Capability

No protected circuit Combinational logic: 294 transistors

Sequential logic: 384 transistors

Area = 584.24 µm2

Delay = tpffp + tplogic + tpffp

None, only inherent masking

Entire circuit protected by Duplication with comparison (DWC)

2x Combinational logic: 588 transistors

2x Sequential logic: 768 transistors

Comparator for the 16-bit output: 156 transistors

Area = 1,300.32 µm2

(+ 122%)

Delay = tpffp + tplogic + tpffp + tpcomparator

SEU and SET detection

Triple Modular Redundancy (TMR) in the entire circuit with single voter at the output

3x Combinational logic: 882 transistors

3x Sequential logic: 1152 transistors

Majority voters for the 16-bit output: 288 transistors

Area = 1,996.92 µm2

(+ 241%)

Delay = tpffp + tplogic + tpffp + tpvoter

SEU and SET correction, but final voter can be upset.

TMR in the entire circuit with triple voter at the output

3x Combinational logic: 882 transistors

3x Sequential logic: 1152 transistors

3x Majority voters for the 16-bit output: 864 transistors

Delay = tpffp + tplogic + tpffp + tpvoter

SEU and SET correction.

Page 47: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 47

Area = 2,492.28 µm2

(+ 326%) Time redundancy in the output of the combinational logic with TMR in the registers.

1x Combinational logic: 294 transistors

3x Sequential logic: 1152 transistors

2x Majority voters for the 8-bit inputs: 288 transistors

1x Majority voters for the 16-bit output: 288 transistors

Considering delay (δ) as 16 transistors (chains of inverters)

Area = 1,752.68 µm2

(+ 199%)

Delay = tpffp + tplogic + δ + δ + tpffp + tpvoter

SEU and SET can be corrected, the added delay (δ) must be chosen according to the SET pulse width.

Built-in Current Sensors in the combinational and sequential logic

Combinational logic: 294 transistors

Sequential logic: 384 transistors

33 Bulk-BICS: 1 for each 21 transistors

Area = 789.79 µm2

(+ 35%)

Delay = tpffp + tplogic + tpffp

SEU and SET detection

Hardened memory cells in the registers

Combinational logic: 294 transistors

2x Sequential logic: 768 transistors

Area = 913.32 µm2

(+ 56%)

Delay = tpffp* + tplogic + tpffp* tpffp*= delay from the hardened flip-flop

SEU correction only. None SET detection and correction.

Error Correction Code such as Hamming code in the input and output of the registers

Combinational logic: 294 transistors

Sequential logic: 384 + (4 + 4 + 5)parity bits x12

Delay = tpenc + tpffp +tpdec + tplogic + tpenc + tpffp + tpdec

SEU correction only. None SET detection and correction.

Page 48: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 48

transistors 2x 8-bit Encoding:

144 transistors 2x 8-bit decoding:

288 transistors 1x 16-bit

Encoding: 144 transistors

1x 16-bit decoding: 288 transistors

Area = 1,460.28 µm2

(+ 150%) Recomputing with Shifted or Swapped operands

Combinational logic: 294 transistors

Sequential logic: 384 transistors

2x 8-bit Multiplexers 2:1: 64 transistors

Comparator for the 16-bit output: 156 transistors

Area = 772.28 µm2

(+ 32%)

Delay = 2 x (tpffp + tpmux + tplogic + tpffp) + tpcomp

SEU and SET detection.

Page 49: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 49

3. Radiation Effects on FPGAs

Field-Programmable Gate Arrays (FPGAs) are configurable integrated circuit

based on a high logic density regular structure, which can be customizable by the end

user to realize different designs. The FPGA architecture is based on an array of logic

blocks and interconnections customizable by programmable switches.

Several different programming technologies are used to implement the

programmable switches. There are three types of such programmable switch technologies

currently in use:

SRAM, where the programmable switch is usually a pass transistor or

multiplexer controlled by the state of a SRAM bit (SRAM based FPGAs)

Antifuse, when an electrically programmable switch forms a low resistance

path between two metal layers. (Antifuses based FPGAs)

EPROM, EEPROM or FLASH cell, where the switch is a floating gate

transistor that can be turned off by injecting charge onto the floating gate.

Customizations based on SRAM are volatile. This means that SRAM-based

FPGAs can be reprogrammed as many times as necessary at the work site and that they

loose their contents information when the memories are not connected to the power

supply. The antifuse customizations are non-volatile, so they hold the customizable

content even when not connected to the power supply and they can be programmed just

once. Each FPGA has a particular architecture. Programmable logic companies such as

Xilinx, Actel, Aeroflex (licensed for Quicklogic FPGAs), Atmel and Honeywell (licensed

for Atmel FPGAs) offer radiation tolerant FPGA families. Each company uses different

mitigation techniques to better take into account the architecture characteristics.

3.1 Antifuse-based FPGAs The Actel RTAX-S family is an example of antifuse-based FPGAs for space

applications. It consists of a regular matrix composed of combinational (C-cells) and

sequential (R-cells) surrounding by regular routing channels, as shown in figure 3-1. All

the customizations of the routing and the C-cells and R-cells are done by an antifuse

Page 50: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 50

element (programmable switch). Results from radiation ground testing have shown that

the programmable switches either based on ONO (oxide-nitride-oxide) or MIM (metal-

insulator-metal) technology are tolerant to ionization and total dose effect [81].

Therefore, the customizable routing is not sensitive to SEU, only the flip-flops used to

implement the design user sequential logic are sensitive to SEU.

Figure 3-1. ACTEL: RTAX-S device

The R-cell is composed of a Triple mode Redundancy (TMR) flip-flop or DFF

with a wired-or voter at the output, as presented in figure 3-2. This makes the R-cell

robust to SEUs. However, at high frequency operation, SETs can be observed [11]. Due

to the number of transistors contained in an R-cell, there exist several points susceptible

to Single Event Transient (SET).

As discussed previously, some of these SETs may be propagated through the

logic and captured by the R-cell, where all the 3 DFFs share the same data, clock, enable,

and reset lines. Due to this fact, a glitch appearing on one of these lines during a clock

edge will most likely appear as the same value to all of the DFFs and will not be correctly

mitigated. As the system clock frequency is increased, so is the probability of capturing

the SET. As the number of levels of combinatorial logic between each DFF increases, the

Page 51: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 51

probability of generating a SET increases. The user may protect the C-cells by using high

level mitigation techniques in the description of the design (TMR, duplication and

others).

Figure 3-2. ACTEL: RTAX-S device

At [11], radiation test results performed in Actel RTAX-S device showed the

influence of the frequency in the error cross-section. The case-studied architectures (shift

registers) are illustrated in figure 3-3. The logic levels between two flip-flops were

chosen from 0, 4, and 8 inverter gates.

Figure 3-3. Shift registers implemented at ACTEL: RTAX-S device for radiation

ground testing [11]

Page 52: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 52

At each LET, several tests were performed at various frequencies on all of the shift register string types [11]. As the frequency increased, the error cross-section increased, as seen in figure 3-4. This is due to the probability of SET propagation and capture. A shift register string containing hardened (TMR) DFFs and combinatorial logic between these hardened flip-flops should present errors only when SETs in the combinational logic are captured by the TMR DFFs. And, as higher is the frequency; higher is the probability to capture the SET.

Figure 3-4. ACTEL: RTAX-S device test when using the shift register with 8

inverters between flip-flops [11]

The RadHard Eclipse FPGA is another example of antifuse-based FPGAs. It is

provided by Aeroflex that uses QuickLogic Corporation’s licensed ESP (Embedded

Standard Products) technology. Its architecture is also composed of a regular matrix of a

configurable logic cell composed of logic and flip-flops surrounding by a regular routing

matrix, as illustrated at figure 3-5. All the customizations are done by a programmable

switch called ViaLink connector. It is fabricated on 0.25µm five-layer metal ViaLink

CMOS process.

The CLB flip-flops are SEU hardened flip-flops, which makes the CLB robust to

SEU as well. However, the CLB logic can be susceptible to SETs that can be propagated

through the logic and being captured by one of the flip-flops. Fault tolerant techniques at

the high level can be implemented to mitigate SETs in the designs synthesized into these

types of FPGAs too.

Page 53: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 53

Figure 3-5. RadHard Eclipse FPGA from Aeroflex

3.2 SRAM-based FPGAs

SRAM-based FPGAs are very attractive due to high density, high performance,

low NRE (Non-Recurring Engineering) cost, fast turnaround time and reconfigurability

feature. For space and remote applications, SRAM-based FPGAs can offer additional

benefits by allowing in-orbit design changes thanks to reconfigurability, which can

reduce the mission cost by correcting errors or improving system performance after

launch. In addition, the same circuitry can be used with different configurations at

different stages of a mission, reducing weight and power requirements. Also, if part of an

FPGA fails, then circuitry can be reprogrammed to make use of remaining functional

portions of the chips.

Xilinx FPGAs have an array composed of configurable logic blocks (CLBs)

surrounded by programmable input/output blocks (IOBs), all interconnected by a

Page 54: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 54

hierarchy of fast and versatile routing resources. Each CLB has a set of Look-up tables

(LUT), multiplexers and flip-flops, which are divided into slices. A LUT is a logic

structure able to implement a Boolean function as a truth table. The CLBs provide the

functional elements for constructing logic while the IOBs provide the interface between

the package pins and the CLBs. The CLBs are interconnected through a general routing

matrix (GRM) that comprises an array of routing switches located at the intersections of

horizontal and vertical routing channel. The FPGA matrix also has dedicated memory

blocks called Block SelectRAMs, clock DLLs for clock-distribution delay compensation

and clock domain control and other components that vary according to the FPGA family.

Virtex devices are quickly programmed by loading a configuration bitstream

(collection of configuration bits) into the device. The device functionality can be changed

at anytime by loading in a new bitstream. The bitstream is divided into frames and it

contains all the information to configure the programmable storage elements in the matrix

located in the Look-up tables (LUT) and flip-flops, CLBs configuration cells and

interconnections.

Figure 3-6 shows a general Xilinx FPGA architecture, where each matrix tile is a

configurable logic block (CLB) with the logic slices and the general routing matrix

(GRM). The characteristic of the CLB logic and slice may change consistent with the

FPGA family.

Due to the technology process evolution, FPGAs are in the nanometer technology

era. As shown in figure 3-7, the latest families Virtex4 and Virtex5 are fabricated in 90

nm and 65 nm, respectively [86]. This evolution has allowed high logic integration.

Nowadays it is possible to implement millions of gates and data memory in a single

FPGA. In addition, there are families composed of hardened microprocessors, such as the

VirtexII-Pro family with a PowerPC connected to the customizable array. The CLBs and

interconnection structures have also evolved in the past decade, figure 3-8.

Page 55: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 55

Figure 3-6. Example of SRAM-based FPGA architecture based on regular array

Page 56: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 56

Figure 3-7. Evolution of Xilinx FPGA families in the last decade

Figure 3-8. CLB logic evolution in the last decade

The CLBs used to contain a small number of 4 input LUTs, where each LUT can

implement any 4-input Boolean logic function, as for example in Virtex family and

nowadays a CLB can contain a large number of 4-input LUTs, as in the Virtex4 family or

Page 57: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 57

even 6-input LUTs, where each LUT can implement any 6-input Boolean logic function,

as in the latest released Virtex5 family. The interconnection structures located in the

GRM have also improved in the last decade, able to reduce the delay and increase the

performance in the implemented designs.

All this evolution has increased the interest on using SRAM-based FPGAs for a

wide range of applications, but at the same time has brought the necessity to analyze

carefully the soft error susceptibility of these high complex structures. The effect of soft

errors in the FPGA architecture in to the implement designs must be evaluated in order to

implement efficient fault tolerant techniques.

In FPGAs, a soft error has a peculiar effect in the user logic design since the

combinational and sequential logics are mapped into the programmable architecture.

Remember that in an ASIC, the effect of a soft error either in the combinational or in the

sequential logic is transient; the only variation is the time duration of the fault. A fault in

the combinational logic creates a transient logic pulse (SET) in a node that can propagate

through the logic according to the logic delay and topology. In other words, this means

that a SET in the combinational logic may or may not be latched by a flip-flop placed at

the combinational logic output. Faults in the sequential logic (SEU) manifest themselves

as bit flips, which will remain in the flip-flop until the next input load.

On the other hand, in a SRAM-based FPGA, both the user’s combinational and

sequential logic are implemented by customizable logic memory cells, in other words,

SRAM cells, as represented in figure 3-9. SEU can occur in all SRAM cells, for example,

in the ones that configure the LUTs, controls the CLB configurations, the routing (GRM)

and others.

When a SEU occurs in a memory cell that configures the LUT, it flips one of the

stored values modifying the implemented combinational logic. This fault has a permanent

effect in the user logic and it can only be corrected at the next load of the configuration

bitstream, when then the LUT is configured again with the original Boolean function

defined by the user.

When a SEU occurs in a memory cell that controls the CLB configurations, as

shown in figure 3-9, the multiplexer controlled by the affect memory cell changes its

Page 58: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 58

connection, and the original connection is undo. It has also a permanent effect and its

effect can be mapped to an open or a short circuit in the user combinational logic

implemented by the FPGA. The fault will also be corrected at the next load of the

configuration bitstream, when the original configuration is loaded to the CLB control

memory cells.

0

0

0

1

0

1

1

1

0

00

0

1

0

1

1

1

A B C D

LUT

F inputs:

clk

E1

E2

E1

E3

E2

E3map

map

FPGA CLB slice:User design logic:

SRAM configuration cells

LUT

1

upset

Figure 3-9. SEU Sensitive Bits in the CLB Slice

When a SEU occurs in a memory cell that controls the routing (GRM), as shown

in figure 3-10, it may affect the multiplexer or the pass transistors responsible to perform

the connection between the logic. The SEU can result in open and short cuts in the logic.

It has also a permanent effect and its effect can be mapped to an open or a short circuit in

the user combinational logic implemented by the FPGA. The fault will also be corrected

Page 59: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 59

at the next load of the configuration bitstream, when the original configuration is loaded

to the CLB control memory cells.

When an upset occurs in the CLB flip-flop or in the embedded memory, it has a

transient effect, because at the next load of the flip-flop or at the new data storage in the

memory, the bit-flip can be corrected. In [34], all these effects are discussed in more

details.

Figure 3-10. Examples of upsets in the SRAM-based FPGA architecture in the general routing matrix (GRM)

Radiation tests performed in Xilinx FPGAs, presented by [10], [62, 63] [78] and

[80], show the effects of SEU in the design application and confirm the necessity of using

fault-tolerant techniques for space applications. A fault-tolerant system designed into

SRAM-based FPGAs must be able to cope with the peculiarities mentioned in this

Page 60: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 60

section such as transient and permanent effects of a SEU in the combinational logic, short

and open circuit in the design connections and bit flips in the flip-flops and memory cells.

Results presented by [62, 63], shows the multiple bit upsets in Virtex SRAM-

based FPGAs. These results are very relevant because they determine the probability of

MBU overcome mitigations techniques applied in these devices. Results show that MBU

events are not as common in the Virtex family; most Virtex resources events have 10%

MBU events compared to VirtexII and Virtex4. The only resource in all three families

that does not follow these patterns is the BRAM blocks because their high density. Figure

3-11 (a and b) show the normalized percentage of MBU events by resource [62, 63].

The normalized percentages are determined by the ratio of the number of MBU

events to all events for the resource. A comparison of the normalized values indicates that

IOBs are very sensitive to MBUs. For the Virtex-II and Virtex-II Pro families IOBs are

nearly as sensitive as CLBs to MBUs. It was observed five-bit and larger events in

Virtex-4. In summary due to the technology scaling, the paper [62, 63] has shown that

MBUs are 27–33 times more common in the Virtex-II and Virtex-II Pro families than in

the earlier Virtex family. MBU events are nearly three times more likely in the Virtex-4

family (fabricated in 90nm process technology) than in the Virtex-II and Virtex-II Pro

families (fabricated in 130nm process technology) and 69 times more likely than in the

Virtex family (fabricated in 220nm process technology).

Page 61: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 61

(a) Virtex family, in 0.22µm process technology [63]

(b) VirtexII family, in 0.13µm process technology [63]

Figure 3-11. Percentage of MBU events in all events induced by heavy ion radiation

for each resource in the Xilinx FPGAs

Page 62: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 62

Concerning single event effects, a set of the results presented by [28] is shown in

figure 3-12. The graphic from figure 3-12(a) shows the upset sensitivity for any physical

bit in the configuration bitstream, largely dominated by the configuration logic blocks

(CLBs). The data for the configuration logic blocks (CLBs) and BlockRAM (BRAM) are

shown separately. The Virtex-4 data look very much like that of the Virtex-II Pro. The

Block RAM cells (open symbols) have a small but consistently higher susceptibility than

the CLBs (filled symbols) in the knee region of the curve on a per-bit basis.

In addition to single event upsets (SEUs), complex devices like the Virtex-4 are

susceptible to single-event-functional-interrupt (SEFI) modes. These are upsets to a

control circuit that disable large portions of the devices function. From studies of prior

Virtex FPGA generations we might expect to see SEFI modes involving the power-on-

reset circuit (POR), failures of the JTAG or SelectMap communication ports, or others. A

possible configuration clock (CCLK) upset observed in the Virtex-II Pro device was the

only Virtex SEFI mode yet seen that required a power-cycle to recover. All other modes

could be recovered by simply reloading the configuration. At this writing, we have

studied only the POR SEFI and looked for modes requiring a power cycle. SEFI results

of the POR are shown in figure 3-12(b) from [28].

(a) BRAM and CLB sensitive parts: VirtexII versus Virtex4

Page 63: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 63

(b) SEFI sensitivity for the Power-On-Reset (POR)

Figure 3-12. Virtex-4 static SEU cross sections for three device types [28]

Note that there is also the possibility of having single event transient (SET) in the

combinational logic used to build the CLB such as input and output multiplexers used to

control part of the routing and the LUTs, as shown in figure 3-13.

The evaluation SET propagation in a design implemented in a FPGA relies on the

analysis of the SET propagation from a LUT through a chain of pass transistors and

multiplexers along the routing until reach a CLB flip-flop. The sensitivity of each node

must be evaluated according to its capacitance and logic connection.

Figure 3-13. SET propagation in SRAM-based FPGA

Page 64: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 64

Figure 3-14 shows a LUT structure based on pass transistors and a multiplexer

also based in a pass transistor tree. The both structures have a valid path that is defined by

the inputs at a time. The sensitive nodes to SET are the drain of all transistors that are at

the off state. For the selected paths in figure 3-14 and the stored values, there are few

sensitive points as indicated. In the case of the LUT that is propagating a ‘0’ value, only

SETs that charge the node needed to be analyzed. These are the ones generated by

ionization in the drain of the PMOS transistors at the off-state placed at the same selected

path.

A B C D

SRAM LUT

‘0’ ‘1’

‘1’

SRAM routing0

1

1

1

1

1

1

1

0

1

0

0

1

0

1

0

Figure 3-14. SET propagation in the internal LUT and routing multiplexers

For the multiplexer that is propagating a ‘1’ value, only SET that discharges the

node needed to be analyzed. These are the ones generated by ionization in the drain of the

NMOS transistors at the off-state placed at the selected path.

Page 65: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 65

4. Radiation Hardening by Design: Strategies for SRAM-based FPGAs

Designers can protect the design at the high-level description (VHDL or Verilog)

level by using some sort of redundancy targeting the FPGA architecture. The most

popular high-level SEU mitigation technique used nowadays to protect designs

synthesized in the SRAM-based FPGAs is the TMR combined with scrubbing. Xilinx has

released the tool called X-TMR that automatically implements TMR into the user

description. But the user himself can also implement the TMR in his design. However,

due to the high area overhead of the TMR, some alternative solutions have been proposed

in the last years. So the user has the flexibility on implementing duplication and self

checking techniques instead of TMR. These techniques may compromise the fault

tolerance in some point but the final result may be acceptable for a set of applications.

In this way, it is possible to use a commercial FPGA part to implement the design

and the soft error mitigation technique is applied to the design description before being

synthesized in the FPGA. The user has the flexibility of choosing the fault-tolerant

technique and consequently the overheads in terms of area, performance and power

dissipation. Figure 4-1 exemplifies the design flow of a general circuit implemented in a

FPGA.

One very important step of the design flow is the validation of the fault tolerance

technique that is usually done by fault injection. The original bitstream configured into

the FPGA can be modified by a circuit or a tool in the computer by flipping one of the

bits of bitstream, one at a time. This flip emulates a SEU in the configuration memory

cells. The output of the design under test (DUT) can be constantly monitored to analyze

the effect of the injected fault into the design. If an error is detected, this means that the

fault tolerant technique implemented is not robust for that specific fault (SEU) in that

target configuration memory bit.

It is possible to inject faults in all the configuration bits and to analyze the most

critical parts of the design [67]. This can help to guide designers in early stages of the

development process to choose the most appropriated fault tolerant design, even before

Page 66: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 66

any radiation ground testing. The entire fault injection campaign can spend from few

hours to days depending on the amount of bits that are going to be flipped and the

connection to the fault injection control circuit. When the entire system (fault injection

control + DUT + golden designs) is implemented at the hardware level (board), avoiding

the communication with the computer, the process is speeded up in orders of magnitude.

Figure 4-1. FPGA mitigation design flow by editing the design hardware description language and the fault injection approach used to validate the design.

As discussed in [37], configuration logic blocks (CLBs), which are composed of

lookup tables (LUTs) for logic generation, storage elements, multiplexers, and carry

logic, in addition with the customizable routing account for by far the largest number of

configurable bits in each device. However, the FPGA devices contain important

functional blocks that can also be upset by radiation and once this occurs the effects can

be catastrophic. Consequently, the susceptibility of these functional blocks must also be

analyzed and mitigation techniques must be applied. Examples are: Digital Clock

Managers (DCMs) provide phase-locked, skew-corrected clock signals to all parts of the

Page 67: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 67

chip, Phase-Matched Clock Dividers (PMCDs) offer additional frequency division

options, Configuration controller circuit, power on reset (POR) circuitry, Input/Output

Blocks (IOBs) implement 28 common single-ended or differential (in pairs) I/O standards

with digitally controlled impedance, each XtremeDSP (DSP48) slice contains a dedicated

18x18-bit multiplier, adder, and 48-bit accumulator and other specialized blocks. Table 4-

1 presents a summary of SEE issues and possible SEU mitigation solutions that have

been presented in [37].

Table 4-1. Representative Xilinx Virtex Family Potential Types of Device SEE Sensitivity from [37]

FPGA component parts SEE Issues Possible SEU mitigations

Configuration Memory

Single and multiple bit errors corrupting circuit operation, causing bus conflicts (current creep), etc…

Scrubbing Partial reconfiguration

Configuration Controller

Improper device configuration can occur if hit during configuration/reconfiguration

Partitioned design Multiple chip voting (Redundancy by using multiple devices)

CLB Logic hits and propagated upsets caused by transients

Triple modular redundancy (TMR) Acceptable error rates

BRAM Memory upsets in user area TMR Error Detection and Correction (EDAC) scrubbing

Half-latches Sensitive structure used in configuration/routing

Removal of half-latches from design

POR SEUs on POR can cause inadvertent reboot of device

Multiple chip voting (Redundancy by using multiple devices)

IOB SEUs can cause false outputs to other devices or inputs to logic

Leverage Immune Config. Memory cell Evaluate input SET propagation

DCM Can cause clock errors that spread across clock cycles

TMR Temporal TMR

DSP Hard IP that is unhardened that can cause single event functional interrupts (SEFIs) or data errors

TMR Temporal TMR

MGT Gigabit transceivers. Hits in logic TMR

Page 68: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 68

can cause bursts or SEFIs. O/w bit errors in data stream

Protocol re-writes

PPC Hard IP that is unhardened. SEFIs are prime concern

TMR or software task redundancy

SEL Higher current condition that is potentially damaging

No mitigation other than substrate addition (epi). Circumvention techniques possible

4.1 Scrubbing It is important to notice that the use of hardware redundancy by itself it is not

sufficient to avoid errors in the FPGA, it is mandatory to reload the bitstream constantly

to avoid the accumulation of faults. This continuous load of the bitstream is called

scrubbing. The scrubbing as explained at the Xilinx Application Notes 138 and 151,

allows a system to repair bit-flips in the configuration memory without disrupting its

operations, which includes the memory cells that configures the LUT, the ones that

control the routing (GMR) and the CLB customization. Configuration scrubbing prevents

the build-up of multiple configuration faults and reduces the time in which an invalid

circuit configuration is allowed to operate. The scrubbing does not refresh the contents of

CLB flip-flops and embedded memories: the Block SelectRAMs. The scrubbing is

performed through the Virtex SelectMAP interface. Furthermore, systems must employ

configuration scrubbing for redundancy-based mitigation techniques such as TMR before

any reliability enhancement is observed. Without scrubbing, the build-up of multiple

faults would eventually break the redundancy.

It is recommended to scrub at least 10X faster than worst-case SEU rate. When

the FPGA is in this mode, an external oscillator generates the configuration clock that

drives the FPGA and PROM that contains the “gold” bitstream. At each clock cycle new

data are available on the PROM data pins. The frequency that scrubbing must be

performed depends on the particle flux and cross-section of the device.

For system-on-chip (SoC) platforms, the Hardware Internal Configuration Access

Port (HWICAP) module can also be used to reconfigure parts of the configuration matrix

from inside the FPGA controlled by the embedded processor (hard core Power-PC or soft

core Microblaze). The ICAP is able to load partial bitstream without interrupt the

application and to configure them. It implements a subset of SelectMAP interface and it

Page 69: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 69

generates no noise during reconfiguration. The ICAP module is connected to the

embedded processor by the available local bus OPB and the EDK tool can be used for

that task.

4.2 Triple Modular Redundancy Triple Modular Redundancy (TMR) is a well-known fault tolerant technique for

avoiding errors in integrated circuits. The TMR scheme uses three identical logic blocks

performing the same task in tandem with corresponding outputs being compared through

majority voters (MAJ). Since all the customizable memory cells are sensitive to soft

errors, single points of failures must be avoided inside the FPGA. Consequently, all the

inputs and outputs must also be triplicate. If an upset occurs in one of the IO cells or in

the routing that connects the logic from/to the IO blocks, there are two other redundant

inputs or outputs able to ensure the correct value. The voter is also triplicate because if

one fails, there are two voters able to maintain the correct value in two redundant logic

parts.

This full TMR, also called X-TMR, is especially suitable for protecting designs

synthesized in SRAM-based Field Programmable Gate Arrays (FPGAs) as proposed by

[16]. Figure 4-2 shows the full TMR that must be applied in the user design logic before

synthesize into the Xilinx FPGA. The CLB flip-flops (user sequential logic) are triplicate

with triple majority voters (MAJ) and a feedback connection that is able to restore the

correct data of the flip-flops. This setup is important as it was seen that the scrubbing is

not able to restore the correct value of a CLB flip-flop. The majority voter defines the

correct output as two out of three input values, as defined in the truth table in the figure

4-2. The very last output voter, which can be placed at the output of CLB flip-flops or

combinational logic blocks, is different than the MAJ voter. Note that there are three

output signals that go to the IO pads. Each one is controlled by a tri-state buffer. The

redundant logic part that holds an error should not pass the error to the output, so the tri-

state buffer should block the faulty redundant part. The TMR output voter choose the tri-

state buffer controller based on one reference value (that is the input coming from one of

the redundant logic part) and the other two inputs coming from the others redundant logic

parts, as shown in the truth table at figure 4-2. Each voter can be implemented in a LUT.

Page 70: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 70

MAJ

MAJ

MAJ

clk0

clk1

clk2

TMR flip-flop

INPUT

package PIN

INPUT

package PIN

REDUNDANT

LOGIC (tr0)

REDUNDANT

LOGIC (tr1)

REDUNDANT

LOGIC (tr2)

OUTPUT

package PIN

TM

R flip

-flop

TM

R O

utp

ut V

ote

r

FPGA

REDUNDANT

LOGIC (tr0)

REDUNDANT

LOGIC (tr1)

REDUNDANT

LOGIC (tr2)

TM

R flip

-flop

REDUNDANT

LOGIC (tr0)

REDUNDANT

LOGIC (tr1)

REDUNDANT

LOGIC (tr2)

LUT: 00010111_00010111CLB flip-flop

R0 R1 R2

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

MAJ

0

0

0

1

0

1

1

1

R0

R1

R2

O_voter

O_voter

O_voter

R2

R1

R0

R0 R1 R2

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

MAJ

0

0

0

1

1

0

0

0

REF

LUT: 00011000_00011000

3-state_0

3-state_1

3-state_2

tr0

tr1

tr2

Figure 4-2. TMR implemented in FPGA

Since all the outputs are connected together outside the FPGA device, this

connection works as an analog voter where the majority prevails, so, even if one output

voter fails, the output can manage to show the correct value.

4.3 Duplication with Comparison with Concurrent Error Detection The TMR technique is a suitable solution for FPGAs because it provides a full

hardware redundancy, including the user’s combinational and sequential logic, the

routing, and the I/O pads. However, it comes with some penalties because of its full

hardware redundancy, such as area, I/O pad limitations and power dissipation. Many

applications can accept the limitations of the TMR approach but some cannot. Aiming to

reduce the number of pins overhead of a full hardware redundancy implementation

(TMR), and at the same time coping with permanent upset effects, a technique based on

duplication with comparison (DWC) and concurrent error detection (CED) technique was

proposed by [34], figure 4-3. The CED must be able to detect the fault free redundant

Page 71: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 71

logic part. The CED should have a smaller area than the redundant logic block in order to

present a reduced area overhead compared to the X-TMR.

MAJ

MAJ

MAJ

clk0

clk1

clk2

TMR flip-flop

INPUT

package PIN

REDUNDANT

LOGIC (dr0)

REDUNDANT

LOGIC (dr1)

OUTPUT

package PIN

TM

R flip

-flop

FPGA

REDUNDANT

LOGIC (dr0)

REDUNDANT

LOGIC (dr1)

TM

R flip

-flop

REDUNDANT

LOGIC (dr0)

REDUNDANT

LOGIC (dr1)

LUT: 00010111_00010111CLB flip-flop

R0 R1 R2

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

MAJ

0

0

0

1

0

1

1

1

CED CED CED

dr0

dr1

dr0

dr1

Logic able to detect which

redundant logic (dr0 or dr1)

is fault free.

CED

Figure 4-3. Duplication with Comparison and Error Concurrent Detection technique (DWC-CED) for SRAM-based FPGAs

The CED scheme can be based on time redundancy. In this way, it recomputes the

input operands in two different ways to detect permanent faults. During the first

computation at the first clock cycle, the operands are used directly in the combinational

block and the result is stored for further comparison. During the second computation at

the second clock cycle, the operands are modified, prior to use, in such a way that errors

resulting from permanent faults in the combinational logic are different in the first

calculation than in the second and can be detected when results are compared. These

modifications are seen as encoder and decoder processes and they depend on the

characteristics of the logic block. The general scheme is presented in figure 4-4.

Page 72: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 72

Figure 4-4. General Time redundancy scheme for permanent fault detection

Figure 4-5 shows the scheme proposed for an arithmetic module, for instance a

multiplier. There are two multiplier modules: mult_dr0 and mult_dr1. There are

multiplexers at the inputs able to provide normal or shifted operands. The outputs

computed from the normal operands are always stored in a sample register, one for each

module. Each output goes directly to the input of the user’s TMR register. Module dr0

connects to register tr0 and module dr1 connects to register tr1. Register tr2 will receive

the module that does not have any fault. By default, the circuit starts passing the module

dr0. A comparator at the output of register dr0 and dr1 indicates when outputs mismatch

(Hc). If Hc=0, no error is found and the circuit will continue to operate normally. If

Hc=1, an error is characterized and the operands need to be recomputed using the RESO

(recomputing with shifted operands) method to detect the module that has fault. The

detection takes one clock cycle. While the circuit performs the detection, the user’s TMR

register holds its previous value. When the faulty free module is found, register tr2

receives the output of this module and it will continue to receive this output until the next

chip reconfiguration (fault correction).

Page 73: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 73

Tc0

= =

Tc1

=

Hc

dr0

CEDCED

ST1 ST0

Fault-free module

=

dr1

01 10ST1 ST1

AB AB

01 10ST0 ST0

encoder encoder encoder encoder

AB AB

decoderdecoder

Figure 4-5. Case-study CED based on encoder and decoder for arithmetic logic blocks.

4.4 Placement and Routing Issues The problem of using fault tolerance techniques based on redundancy and

majority voters is that one must ensure that SEU can not affect more that one redundant

domain of the design, [40], [64]. If a SEU is able to affect two domains of a redundant

design, the majority voter is not able to choose the correct results out of three, and errors

can appear in the design output. The only way a single fault can affect more than one

redundant domain is by upsetting the SRAM cells controlling the routing connections.

The upsets in the routing represent the main concern, as 90% of the SRAM cells

inside the FPGA are responsible for routing control. The main effects of an upset in the

routing are open lines and shortcuts between distinct lines as it was discussed previously.

The probability of SEUs in the routing upsetting more than one redundant domain

depends on the logic placement and the number of majority voters in the design.

In figure 4-6, there are two examples of upsets in the routing. Upset “a” connects

two signals from the same redundant domain, which does not generate an error in the

Page 74: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 74

TMR output, because the outermost voters will vote the upset effect. However, upset “b”

may provoke an error in the TMR output, because it connects two signals from distinct

redundant logic blocks affecting two out of three redundant domains of the TMR. In the

next sections three solutions will be discussed to improve reliability in this matter.

Figure 4-6. SEU in the routing affecting two distinct redundant logic parts

4.4.1 Solutions based on Placement and Routing Dedicated floorplanning for each redundant part of the TMR can reduce the

probability of upsets in the routing affecting two or more logic modules, but it may not be

sufficient, since placement can be too complex in some cases. Remember that each time

it is necessary to include voters, there are connections between the redundant parts, which

make impossible to place the redundant logic parts very far away from each other with no

connections at all, figure 4-7. One solution is the Reliability-Oriented Place and Route

algorithm (RoRA) proposed by [76, 77], which is a place and route algorithm for SRAM-

based FPGAs able to enforce particular technique in order to enforce every circuit

mapped on SRAM-FPGAs against SEUs in their configuration memory cells. Routing

duplication can also be a solution to improve reliability in TMR. In [33], it is proposed a

method to duplicate the routing locally inside the CLB to avoid problems with open and

short circuits provoked by SEUs in the routing.

Page 75: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 75

Figure 4-7. Majority voter placement in the TMR approach

4.4.2 Solutions based on Voting Adjustments The first voting adjustment was proposed by [35]. It is proposed a logic partition

in order to add more voter stages in the circuit. If the redundant logic parts tr0, tr1 and tr2

(represented in figure 4-6 after the TMR register with voters and refresh) are partitioned

in smaller logic blocks with voters, a connection between signals from distinct redundant

parts could be voted by different voters. This logic partition by voters is represented in

figure 4-8. Notice that now the upset “b” can not provoke an error in the TMR output,

which increases the robustness of the TMR in the presence of routing upsets without

being of concern to floorplanning. The problem is to evaluate the best size of the logic to

achieve the best robustness. If the logic is partitioned in very small blocks, the number of

voters will increase dramatically, causing an overly costly TMR implementation. The

objective is finding the best partition in terms of area cost, performance and robustness.

The results presented by [35] suggest that there is a trade off between the logic

partition of the throughput logic (and consequently between the number of voters) and

the number of routing upsets that could provoke an error in the TMR. In contrary to what

was expected, large number of voters does not always mean larger protection against

Page 76: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 76

upsets. There is an optimal logic partition for each circuit that can reduce the propagation

of the upset effect in the routing.

Figure 4-8. Triple Modular Redundancy (TMR) scheme with logic partition in the FPGA

4.5 Partial Triple Modular Redundancy A partial TMR mitigation strategy was proposed in [52, 61] and it is based on the

idea that there are more critical parts than others in the circuit and not all logic blocks

need to be protected by TMR. Some of non-critical blocks can only be corrected by

scrubbing from time to time.

The idea is based that sensitive configuration bits can be separated into two

categories called “persistent” and “non-persistent” [52, 61], shown in figure 4-9. A non-

persistent configuration bit is a sensitive configuration bit that will cause a functional

error in the design but when the non-persistent configuration bit is repaired through

configuration scrubbing, the design returns to normal operation. And eventually all

previously induced functional errors will disappear. No additional intervention is required

to return the circuit to normal functionality. A persistent configuration bit is a sensitive

configuration bit that will also cause functional error, however, after repairing the upset

configuration bits through configuration scrubbing, the FPGA circuit does not return to

normal operation. This is due to the errors that are stored in flip-flops at feedback loops

that can not be corrected by scrubbing. In this case, a global reset is needed to return the

circuit to a proper state, or normal operation. This global reset takes the circuit offline for

the time needed to reset the circuit and start up in normal operating mode.

In [52, 61], it is proposed that feedback structures of the design should be

mitigated first because they are more critical. Any logic feeding into the feedback

Page 77: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 77

structures should follow since these contribute to the state of the design and thus the

persistence. The feed forward logic, the non-persistent circuit components, does not

contribute to the persistence of a design and should be mitigated last. This depends on the

application and the expected mean time between failures (MTBF).

Figure 4-9. Example of non-persistent and persistent upset defined at [61].

Page 78: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 78

5. Final Remarks

This manuscript has explored fault tolerant techniques to protect integrated

circuits against soft errors. A set of hardening by design solutions for application specific

circuits (ASICs) and for field programmable gate arrays (FPGAs) was presented and

discussed.

The main challenge for ASIC is to have techniques able to work with the new

paradigm for nanometer technologies: the occurrence of transient pulses with duration

longer than the cycle time of the circuits, that may affect one or more bits of the circuit

output, and multiple faults, thereby making obsolete most of the currently known

mitigation techniques. For memory cells, the traditional solutions such as ECCs and

hardened memory cells can still be applied but taking into account the probability of

multiple upsets.

The main challenge for FPGAs is to characterize the user design sensitivity to soft

error once the design is implemented in the SRAM-based FPGA and to define the most

efficient redundancy that must be applied for a limited area resource. The effect of soft

error in a user logic design synthesized in a SRAM-based FPGA was detailed analyzed.

Triple Modular Redundancy (TMR) and Duplication with Comparison and Concurrent

Error Detection (DWC-CED) techniques were presented. Also, the issues about the

placement and routing of redundant blocks inside the FPGAs were discussed and some

solutions were proposed.

In summary, there is no hardening by design solution that is totally efficient for

all types of circuits, applications and environments. It is important to characterize very

well the sensitivity to soft error of your target design and application and then choose a

set of fault tolerant solutions that will work properly in your design. The ideal solution

for a reliable system may be composed of solutions that pass at different steps of your

design processes: from layout constraints, transistor level redundancy, logic level

solutions, recomputation and system level approaches.

Page 79: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 79

References

[1] Alexandrescu, D., Anghel, L., Nicolaidis, M., “New methods for evaluating the impact of single event transients in VDSM ICs”, in IEEE International Symposium on Defect And Fault Tolerance in VLSI Systems, DFT, 17., 2002. p. 99-107.

[2] Anghel, A., Nicolaidis, M., “Cost Reduction and Evaluation of a Temporary Faults Detecting Technique”, in Proc. DATE, IEEE Computer Society, 2000, p. 591-598.

[3] Anghel, L., Alexandrescu, D., Nicolaidis, M., “Evaluation of a soft error tolerance technique based on time and/or space redundancy”, in the Proceedings of Symposium on Integrated Circuits and Systems Design, SBCCI, 13., 2000. Proceedings… Los Alamitos : IEEE Computer Society, 2000. p. 237-242.

[4] Asadi, G., Tahoori, M., “An Accurate SER Estimation Method Based on Propagation Probability Design”, in Proceedings of Automation and Test in Europe Conference, DATE, 2005.

[5] Athan, S., Landis, D., Al-Arian, S., “A Novel Built-in Current Sensor for IDDQ Testing of Deep Submicron CMOS ICs, In Proceedings of 14th VLSI Test Symposium, 1996. pp 118-123.

[6] Austin, T. “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design”, in MICRO32 - Proceedings of the 32nd ACM/IEEE International Symposium on Microarchitecture, pages 196-207, Los Alamitos, CA, November, 1999.

[7] Barth, J., “Applying Computer Simulation Tools to Radiation Effects Problems”, in: IEEE Nuclear Space Radiation Effects Conference Short Course, NSREC, 1997.

[8] Baumann, R., “The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction”, Electron Devices Meeting, 2002. IEDM '02. Digest. International, Dec., 2002, p. 329-332.

[9] Baumann, R.; Smith, E., “Neutron-induced boron fission as a major source of soft errors in deep submicron SRAM devices”, in: Proceedings of IEEE International Reliability Physics Symposium, 38., IEEE Computer Society, 2000.

[10] Berg, M., “Fault Tolerance Implementation within SRAM Based FPGA Design Based upon the Increased Level of Single Event Upset Susceptibility”, in IEEE International On-line Test Symposium, IOLTS, 2006. pp. 89-91.

Page 80: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 80

[11] Berg, M., Wang, J.J., Ladbury, R., Buchner, S., Kim, H., Howard, J., LaBel, K., Phan, A., Irwin, T., Friendlich, M., “An Analysis of Single Event Upset Dependencies on High Frequency and Architectural Implementations within Actel RTAX-S Family Field Programmable Gate Arrays”, IEEE Transactions On Nuclear Science, VOL. 53, NO. 6, Dec., 2006. p. 3569- 3574.

[12] Bessot, D.; Velazco, R., “Design of SEU-hardened CMOS memory cells: the HIT Cell”, in European Conference on Radiation and Its Effects on Components and Systems, RADECS, 2., 1993. p. 563-570.

[13] Buchner, S.; Campbell, A.; Meehan, T.; Clark, K.; Mcmorrow, D.; Dyer, C.; Sanderson, C.; Comber, C. Kuboyama, S., “Investigation of Single-Ion Multiple-Bit Upsets in Memories on Board a Space Experiment”, IEEE Transactions on Nuclear Science, Vol. 47, Issue 3, pp. 705-711, June 2000.

[14] Calin, T., Nicolaidis, M., Velazco, R., “Upset hardened memory design for submicron CMOS technology”, IEEE Transactions on Nuclear Science, New York, v.43, n.6, p. 2874 -2878, Dec. 1996.

[15] Canaris, J.; Whitaker, S., “Circuit techniques for the radiation environment of space”, in the Proceedings of Custom Integrated Circuits Conference, 1995. p. 77-80.

[16] Carmichael, C., Triple Module Redundancy Design Techniques for Virtex® Series FPGA: Application Notes 197. San Jose, USA: Xilinx, 2000.

[17] Cazeaux, J., Rossi, D., Oma˜na, M., Metra, C., Chatterjee, A., “On Transistor Level Gate Sizing for Increased Robustness to Transient Faults”, in Proceedings of International On-line Test Symposium, IOLTS, 2005.

[18] Chen, C., Hsiao, M., “Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review,” IBM J. Res. Develop., Vol. 28, pp. 124-134, Mar. 1984.

[19] Crain, S. et al., “Analog and digital single-event effects experiments in space”, IEEE Transactions on Nuclear Science, New York, v.48, n.6, Dec. 2001.

[20] Dhillon, Y., Diril, A., Chatterjee, A., Singh, A., “Analysis and Optimization of Nanometer CMOS Circuits for Soft-Error Tolerance”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL. 14, NO. 5, May 2006. pp. 514-524.

[21] Dodd, P. E., et al., “Production and propagation of Single-Event Transients in High-Speed Digital Logic ICs”, IEEE Transactions on Nuclear Science, Vol 51, No 6, Part 2, IEEE Computer Society, Los Alamitos, CA, December 2004, pp 3278-3284.

Page 81: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 81

[22] Dodd, P. E., Massengill, L. W. “Basic Mechanism and Modeling of Single-Event Upset in Digital Microelectronics”, IEEE Transaction on Nuclear Science, vol. 50, June, 2003, pp. 583-602.

[23] Dodd, P., “Physics-Based Simulation of Single-Event Effects IEEE Transactions On Device And Materials Reliability”, VOL. 5, NO. 3, Sept. 2005. pp.343-457.

[24] Dupont, E.; Nicolaidis, M.; Rohr, P., “Embedded robustness IPs for transient-error-free ICs”. IEEE Design & Test of Computers, New York, v.19, n.3, May-June 2002, p. 54-68.

[25] Ferlet-Cavrois, V. et al., “Statistical Analysis of the Charge Collected in SOI and Bulk Devices Under Heavy Ion and Proton Irradiation - Implications for Digital SETs”, IEEE Transactions on Nuclear Science, Vol 53, No 6, Part 1, IEEE Computer Society, Los Alamitos, CA, December 2006, pp 3242-3252.

[26] Gadlage, M. J., Schrimpf, R. D., Benedetto, J. M., Eaton, P. H., Mavis, D. G., Sibley, M., Avery, K., and Turflinger, T. L., “Single Event Transient Pulsewidths in Digital Microcircuits”, IEEE Transactions on Nuclear Sciences, Vol. 51, No 6, Part 2, IEEE Computer Society, Los Alamitos, CA, December 2004, pp. 3285-3290.

[27] Galke, C., Pflanz, M., Vierhaus, H., “On-line Detection and Compensation of Transient Errors in Processor Pipeline-Structures”, in Proceedings of the International On-line Test Symposium, IOLTS, 2002.

[28] George, J., Koga, R., Swift, G., Allen, G., Carmichael, C., Tseng, C., “Single Event Upsets in Xilinx Virtex-4 FPGA Devices”, IEEE Radiation Effects Data Workshop, 2006, p.109 – 114.

[29] Henes, E., Vieira, M., Ribeiro, I., Wirth, G., Kastensmidt, F. L., “Using Bulk Built-in Current Sensors in Combinational and Sequential Logic to Detect Soft Errors”, IEEE Micro, IEEE Computer Society, v. Set-Ou, p. 10-18, 2006.

[30] Hertwig, A., Hellebrand, S., Wunderlich, H., “Fast Self-Recovering Controllers”, in Proceedings of 16th IEEE VLSI Test Symposium, 1998.

[31] Houghton, A. D. “The Engineer’s Error Coding Handbook”. Londres: Chapman & Hall, 1997.

[32] Johnston, A., “Scaling and Technology Issues for Soft Error Rates”, in Proceedings of 4th Annual Research Conference on Reliability, Stanford University, October 2000.

[33] Kastensmidt, F. L.;, Kinzel Filho, C., Carro, L., “Improving Reliability of SRAM Based FPGAs by Inserting Redundant Routing”, IEEE Transactions on Nuclear Science, New York, v. 53, n. 4, 2006. p. 2060-2068.

Page 82: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 82

[34] Kastensmidt, F., Neuberger, G., Carro, L., Reis, R., Hentschke, R., “Designing Fault- Techniques for SRAM-based FPGAs”, IEEE: Design and Test of Computers (D&T), v.21, n.6, Dec., 2004.

[35] Kastensmidt, F., Sterpone, L., Carro, L., Sonza Reorda, M., “On the Optimal Design of Triple Modular Redundancy Logic for SRAM-based FPGAs”, in the Proceedings of Design Automation and Test in Europe (DATE), IEEE, 2005.

[36] Label, K. et al., “A roadmap for NASA's radiation effects research in emerging microelectronics and photonics”, in Proceedings of IEEE Aerospace Conference, 2000, IEEE Computer Society, 2000. p. 535-545.

[37] LaBel, K., Berg, M., Black, D., Robinson, W., Jordan, A., “Trade Space Involved with Single Event Upset (SEU) and Transient (SET) Handling of Field Programmable Gate Array (FPGA) Based Systems”, 2006 Workshop on Hardened Electronics and Radiation Technology, HEART, 2006.

[38] Lacoe, R., “CMOS Scaling Design Principles and Hardening-by-Design Methodologies,” IEEE NSREC Short Course, 2003.

[39] Leray, J., “Earth and Space Single-Events in Present and Future Electronics”, in European Conference on Radiation and Its Effects on Components and Systems, RADECS, 6., 2001. Short Course. IEEE Computer Society, 2001.

[40] Lima, F., Carmichael, C., Fabula, J., Padovani, R., Reis, R., “A fault injection analysis of Virtex® FPGA TMR design methodology”, in European Conference on Radiation and Its Effects on Components and Systems, RADECS, 2001. pp. 275-282.

[41] Lima, F., Cota, E., Carro, L., Lubaszewski, M., Reis, R., Velazco, R., Rezgui, S., “Designing a radiation hardened 8051-like micro-controller”, in Proceedings of IEEE Symposium on Integrated Circuits and Systems Design, SBCCI, 13., 2000. pp. 255-260.

[42] Lisboa, C. A., Erigson, M. I., Carro, L., “System Level Approaches for Mitigation of Long Duration Transient Faults in Future Technologies”, in Proceedings of the 12th IEEE European Test Symposium, ETS, 2007.

[43] Lisbôa, C. A., Schüler, E., Carro, L., “Going Beyond TMR for Protection Against Multiple Faults”, in Proceedings of the 18th Symposium on Integrated Circuits and Systems Design, SBCCI, 2005, pp. 80-85.

[44] Liu, M.N., Whitaker, S., “Low power SEU immune CMOS memory circuits”, IEEE Transactions on Nuclear Science, New York, v.39, n.6, p. 1679-1684, Dec. 1992.

Page 83: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 83

[45] Messenger, G. C., “Collection of Charge on Junction Nodes from Ion Tracks”, IEEE Transactions on Nuclear Sciences, vol. NS-29, pp. 2024-2031, 1982.

[46] Michels, A., Petroli, L., Lisboa, C. L., Kastensmidt, F. L., Carro, L., “SET Fault Tolerant Combinational Circuits Based on Majority Logic”, in Proceedings of IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT, 2006.

[47] Mikkola, E., Vermeire, B. Barnaby, H. J., Parks, H. G., and Borhani, K., “SET Tolerant CMOS Comparator”, IEEE Transactions on Nuclear Science, vol. 51, no. 6, pp 3609-3614, IEEE Computer Society, New York-London, December, 2004.

[48] Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., Kim, K., “Combinational Logic Soft Error Correction”, in Proceedings of International Test Concference, ITC, 2006.

[49] Mitra, S., Mccluskey, E., “Which Concurrent Error Detection Scheme To Choose?”, in: IEEE International Test Conference, ITC, 2002.

[50] Mohanram, K., “Soft Error Failure Rate Estimation in Combinational Logic Circuits”, Proceedings of the 6th Latin-American Test Workshop, Salvador., Brazil, 2005. pp.181-186.

[51] Morgan, K., Caffrey, M., Graham, P., Johnson, E., Pratt, B., Wirthlin, M., “SEU-induced persistent error propagation in FPGAs”, IEEE Transactions on Nuclear Science, December 2005.

[52] Neuberger, G. , Kastensmidt, F. L., Carro, L., Reis, R. “A Multiple Bit Upset Tolerant SRAM Memory”, Transactions on Design Automation of Electronic Systems, TODAES, v.8, 2003. pp.577-590.

[53] Neuberger, G., Kastensmidt, F. L., Reis, R., “Designing an Automatic Technique for Optimization of Reed-Solomon Codes to Improve Fault-tolerance in Memories”, IEEE Design and Test of Computers, USA, v. 22, n. 1, 2005. p.50-58.

[54] Neves, C., Henes Neto, E. C., Ribeiro, I., Wirth, G., Kastensmidt, F. L., Guntzel, J., “Automatic Evaluation of Single Event Transient Propagation in CMOS Logic Circuits Based on Topological Timing Analysis”, in Proceedings of Latin-American Test Workshop, LATW, 2006. p. 49-54.

[55] Nicolaidis, M., "Time redundancy based soft-error tolerance to rescue nanometer technologies", in Proc. VLSI Test Symposium, IEEE Computer Society, 1999, pp. 86-94.

Page 84: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 84

[56] Nieuwland, A., Jasarevic, S., Jerin, G., Combinational Logic Soft Error Analysis and Protection. In Proceedings of IEEE International On-line Testing Symposium, IOLTS 2006.

[57] Normand, E., “Correlation of in-flight neutron dosimeter and SEU measurements with atmospheric neutron model”, IEEE Transactions on Nuclear Science, New York, v.48, n.6, p. 1996-2003, Dec. 2001.

[58] Normand, E., “Single event upset at ground level”, IEEE Transactions on Nuclear Science, New York, v.43, n.6, p. 2742 -2750, Dec. 1996.

[59] O'bryan, M. et al., “Compendium of Single Event Effects Results for Candidate Spacecraft Electronics for NASA”, in IEEE Nuclear and Space Radiation Effects Conference, NSREC, 2006.

[60] Omana, M., Papasso, G., Rossi, D., Metra, C., “A Model for Transient Fault Propagation in Combinatorial Logic”, in Proceedings of the 9th IEEE International On-Line Testing Symposium, IOLTS, 2003.

[61] Pratt, B., Caffrey, M., Graham, P., Morgan, K., Wirthlin, M., “Improving FPGA Design Robustness with Partial TMR”, 44th Annual IEEE International Reliability Physics Symposium Proceedings, 2006. p. 226 – 232.

[62] Quinn, H., Graham, P., "Terrestrial-Based Radiation Upsets: A Cautionary Tale," IEEE Symposium on Field-Programmable Custom Computing Machines, 2005.

[63] Quinn, H.; Graham, P.; Krone, J.; Caffrey, M.; Rezgui, S., “Radiation-induced multi-bit upsets in SRAM-based FPGAs”, in IEEE Transactions on Nuclear Science, Vol. 52, Issue 6, Dec. 2005. pp. 2455 – 2461.

[64] Rebaudengo, M., Sonza Reorda, M., Violante, M., “Simulation-based Analysis of SEU effects of SRAM-based FPGAs”, in the Proceeding of Field Programmable Logic, FPL, 2002. Los Alamitos : IEEE Computer Society, 2002. p. 607-615.

[65] Reed, R. A., Carts, M. A., Marshall, P. W.;Musseau, O., Mcnulty, P. J., Roth, D. R., Buchner, S., Melinger, J., Corbiere, T., “Heavy Ion and Proton Induced Single Event Multiple Upsets”, IEEE Transactions on Nuclear Science, Vol. 44, Issue 6, pp. 2224-2229, December 1997.

[66] Rejimon T., Bhanja, S., “A Timing-Aware Probabilistic Model for Single-Event-Upset Analysis”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, VOL. 14, NO. 10, Oct. 2006, pp.1130-1139.

[67] Reorda, S., Sterpone, L., Violante, M., “Efficient estimation of SEU effects in SRAM-based FPGAs”, 11th IEEE International On-Line Testing Symposium, 2005. p. 54 – 59.

Page 85: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

NSREC’07 Short course Fernanda Lima Kastensmidt 85

[68] Rockett, L. R., “A design based on proven concepts of an SEU-immune CMOS configurable data cell for reprogrammable FPGAs”, Microelectronics Journal, Elsevier, v.32, p. 99-111, 2000.

[69] Rockett, L. R., “An SEU-hardened CMOS data latch design”, IEEE Transactions on Nuclear Science, New York, v.35, n.6, p. 1682-1687, Dec. 1988.

[70] Rossi, D., Omaña, M., Toma, F. and Metra, C., “Multiple Transient Faults in Logic: An Issue for Next Generation ICs ?”, in Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT 2005), IEEE Computer Society, Los Alamitos, CA, October 2005, pp. 352-360.

[71] Schüler, E., Carro, L., “Reliable Circuits Design Using Analog Components”, in Proceedings of the 11th Annual IEEE International Mixed-Signals Testing Workshop – IMSTW 2005, Volume 1, IEEE Computer Society, Cannes, June 27-29, 2005, pp 166-170.

[72] Semiconductor Industry Association. International Technology Roadmap for Semiconductors – ITRS 2005, last access May 25, 2006. http://www.itrs.net/Common/2005ITRS/Home2005.htm.

[73] Shirvani, P., Saxena, N., Mccluskey, E., “Software Implemented EDAC Protection Against SEUs”, Center for Reliable Computing, May 2001.

[74] Shivakumar, P. et al. “Modelling the Effect of Technology Trends on the Soft Error Rate of Combitional Logic”. In: International Conference on Dependable Systens and Networks. 2002.

[75] Srinivasan, G. R., “Modeling the Cosmic-Ray-Induced Soft-Error Rate in Integrated Circuits: An Overview”. IBM Journal of Research and Development, Vol. 40, No. 1, 1996, pp. 77-90.

[76] Sterpone, L., Reorda, M.S., Violante, M., “RoRA: a reliability-oriented place and route algorithm for SRAM-based FPGAs”, Research in Microelectronics and Electronics, 2005, Volume 1, 2005. p.173 – 176.

[77] Sterpone, L., Violante, M., “A design flow for protecting FPGA-based system against single event upsets”, in the Proceedings of 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI System (DFT’05), October 3-5, 2005, pp. 436-444.

[78] Swift, G. M., Rezgui, S., George, J., Carmichael, C., “Dynamic Testing of Xilinx Virtex-II Field Programmable Gate Array(FPGA) Input/Output Blocks(IOBs)”, IEEE Transactions on Nuclear Science, VOL. 51, NO. 6, 2004.

Page 86: kastensmidt nsrec sc07 cameraready - INFfglima/kastensmidt_draft.pdf · 1. Radiation Effects on Digital ICs Fault-tolerance is defined as a set of techniques to provide a service

Fernanda Lima Kastensmidt NSREC’07 Short course 86

[79] Velazco, R. et al., “Two CMOS memory cells suitable for the design of SEU-tolerant VLSI circuits”, IEEE Transactions on Nuclear Science, New York, v.41, n.6, p. 2229–2234, Dec. 1994.

[80] "Virtex-II Static Characterization", Xilinx Single Event Effects Consortium, 2004, http:Hparts.jpl.nasa.gov/docs/swift/virtex2 0104.pdf

[81] Wang, J.J., RTAXS Single Event Effects Test Rep., Aug. 2004 [available on-line at http://www.actel.com/documents/RTAXS_SEE_Report.pdf]

[82] Wang, N., Patel, S. ReStore: Symptom Based Soft Error Detection in Microprocessors. In Proceedings of the International Conference on Dependable Systems and Networks, DSN, 2005.

[83] Weaver, H., et al., “An SEU Tolerant Memory Cell Derived from Fundamental Studies of SEU Mechanisms in SRAM”, IEEE Transactions on Nuclear Science, New York, v.34, n.6, Dec. 1987.

[84] Whitaker, S., Canaris, J., LIU, K., “SEU hardened memory cells for a CCSDS Reed-Solomon encoder”, IEEE Transactions on Nuclear Science, New York, v.38, n.6, p. 1471-1477, Dec. 1991.

[85] Wirth, G., Vieira, M., Henes, E., Kastensmidt, F. L. “Modeling the sensitivity of CMOS circuits to radiation induced single event transients”. Microelectronics Reliability. Elsevier, 2007.

[86] Xilinx Inc. Virtex® Series Datasheets and Application Notes, www.xilinx.com, 2006.

[87] Zhang, B., Wang, W., Orshansky, M., “FASER: Fast Analysis of Soft Error Susceptibility for Cell-Based Designs”, Workshop on System Effects of Logic Soft Errors, SELSE, 2005.

[88] Zhang, M., Shanbhag, N., “Soft-Error-Rate-Analysis (SERA) Methodology”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, VOL. 25, NO. 10, Oct. 2006. pp.2140-2155.

[89] Zhou, Q. et al., ''Transistor Sizing for Radiation Hardening'', in Proceedings of IRPS, 2004. pp. 310-315.