naser derakhsh - kth.diva-portal.org
TRANSCRIPT
DesRec
signaconfigu
SysAM
ndImuratiostemsMasterThe
mplemonCononSResisPrese
Naser
E
Axe
S
Cristia
Spr
entatintrolleRAM‐BntedtoPo
By:
Derakhsh
Examiner:
el Jantsch
upervisor:
ana Bolch
ring 2013
ionoferforBasedolitecnico
han
h
hini
3
faHarrSelf‐HFPGAdiMilano
rdeneHealinAs
edng
I
IN THE NAME OF GOD
II
Abstract
As digital systems become large and complex, their dependability is getting more important, particularly
in mission‐critical and safety‐critical applications. Among various available platforms for implementing a
digital system, SRAM‐based Field Programmable Gate Arrays (FPGAs) are increasingly adopted in
embedded systems due to their flexibility in achieving multiple requirements such as low cost, high
performance, and fast turnaround time compared to Fixed Application Specific Integrated Circuits
(ASICs). The most attractive feature of SRAM‐based FPGAs is the ability of re‐programming1 the device in
a few clock cycles. This feature is further enhanced by the introduction of Partial Dynamic
Reconfiguration (PDR). PDR allows reconfiguration partially and on the fly, while the device is operating.
Nevertheless, SRAM‐based FPGAs are more susceptible to faults compared to other type of FPGAs and
ASICs. One of these faults, which mostly happen in higher altitude2, is bit flop in configuration memory
caused by ionizing radiation. If this bit flop alters the critical bits3 in the configuration memory, the
function of the design can be corrupted. Thus, appropriate hardening techniques should be used in
order to increase device dependability.
In general, fault tolerant techniques are mostly based on spatial redundancy. However, these
techniques can be combined with FPGA’s re‐configuration capability for recovery. Since the complexity
of systems is increasing and utilizing hardening techniques demand higher resources, a single FPGA may
not suffice to contain whole system. In this case, multi‐FPGA platforms would be taken into account.
In this thesis, a hardened generic reconfiguration controller that manages the occurrence of soft‐errors
in self‐healing systems implemented on SRAM‐based FPGAs is demonstrated and analyzed. The
controller shows the ability to correct the SEUs in the configuration memory ‐ in both static and partial
reconfigurable regions ‐ by means of Xilinx PDR capability. Moreover, the controller, itself, is hardened
with fault‐tolerant techniques and it is able to detect and mask its own errors. The developed controller
is compared with similar approaches based on micro‐controller inside the FPGA. Eventually, the
presented structure is proven fully functional on XUPV5‐LX110T evaluation board.
1 Re‐configuring 2 40000 feet and above 3 critical bits are those bits that cause functional failure if they change state
III
Preface
This report is provided as a master thesis to fulfill the requirement for master degree in System on chip
Program at ICT School of Royal Institute of Technology (KTH). This thesis is carried out at spring 2012 at
Politecnico di Milano during an exchange study.
I would like to take this opportunity to express my sincere appreciation to Prof. Cristiana Bolchini, my
supervisor at Politecnico di Milano, for her constant support, motivation and guidance during this
project. Further, I would like to thank Dr. Antonio Miele, Dr. Chiara Sandionigi and Matteo Carminati for
their practical advices and all MicroLAB students for their kind support during this thesis. I would like
also to show my sincere gratitude to all KTH and Politecnico staff which I might not remember their
names but they helped me a lot to finish my master thesis.
Last and the foremost, I wish to thank my parents, Akbar Derakhshan and Tooran Hamedmoghadam,
that nothing can be comparable with their dedications, spiritual support and encouragements in my
whole life. Moreover, I wish to kindly thank my lovely wife, Zeinab Hassani, who broke her study in Iran
to company me during my study abroad. I really could have not finished my master study without her
support.
IV
TableofContents1 Introduction .......................................................................................................................................... 1
2 Background and Related Work ............................................................................................................. 3
2.1 Motivation ..................................................................................................................................... 3
2.2 Working scenario .......................................................................................................................... 4
2.3 Adopted Fault Model .................................................................................................................... 5
2.4 Self‐Healing System Architecture ................................................................................................. 5
2.5 SEU Mitigation Schemes ............................................................................................................... 8
2.6 Summary ....................................................................................................................................... 9
3 Proposed Controller Architecture ....................................................................................................... 11
3.1.1 Implemented design in the Master side ............................................................................. 14
3.1.2 Implemented design in the slave side ................................................................................ 23
3.2 Summary ..................................................................................................................................... 26
4 Design Hardening ................................................................................................................................ 27
4.1 State Machine Encoding ............................................................................................................. 27
4.2 Internal Signal Hardening ............................................................................................................ 28
4.3 Interface Hardening .................................................................................................................... 28
4.4 Bitstream Memory Protection .................................................................................................... 29
5 Test Results ......................................................................................................................................... 30
6 Conclusion and Future Works ............................................................................................................. 33
7 Glossary ............................................................................................................................................... 34
8 Works Cited ......................................................................................................................................... 35
9 Appendices .......................................................................................................................................... 39
9.1 Appendix A: Bitstream Scrubbing and Readback ........................................................................ 39
9.2 Appendix B: Redundancy ............................................................................................................ 42
9.3 Appendix C: Xilinx Virtex‐5 overview .......................................................................................... 44
9.4 Appendix D: Configuration modes in Virtex 5 ............................................................................. 47
9.4.1 Configuration Modes and Pins in Virtex 5 [31] ................................................................... 47
9.4.2 Serial Configuration Interface [31] ...................................................................................... 47
V
LIST OF FIGURES FIGURE 1 BASIC PREMISE OF PARTIAL RECONFIGURATION .......................................................................................................... 6
FIGURE 2 FT SYSTEM ON MULTI‐FPGA PLATFORM. DISTRIBUTED SOLUTION (LEFT); CENTRALIZED SOLUTION (RIGHT) ............................ 8
FIGURE 3 A CONFIGURATION CONTROLLER BLOCK‐DIAGRAM BASED ON MICROBLAZE ....................................................................... 8
FIGURE 4 RECONFIGURATION CONTROLLER BLOCK DIAGRAM ..................................................................................................... 12
FIGURE 5 SLAVE FPGA (LEFT) AND MASTER FPGA (RIGHT)....................................................................................................... 13
FIGURE 6 CONFIGURATION CONTROLLER BLOCK DIAGRAM ....................................................................................................... 13
FIGURE 7 BLOCK DIAGRAM OF THE MASTER SIDE AND THE TOP MODULE SIGNALS ......................................................................... 14
FIGURE 8 PR CONTROLLER INTERFACE ................................................................................................................................... 15
FIGURE 9 MODULES INSIDE THE TOP (MASTER SIDE) .............................................................................................................. 16
FIGURE 10 FAULT‐CLASSIFIER INTERFACE .............................................................................................................................. 17
FIGURE 11 FAULT CLASSIFIER FINITE STATE MACHINE DIAGRAM ................................................................................................. 18
FIGURE 12 PR CONTROLLER INTERFACE ................................................................................................................................. 19
FIGURE 13 PR CONTROLLER FINITE STATE MACHINE DIAGRAM ................................................................................................... 20
FIGURE 14 COMPLETE BLOCK DIAGRAM ................................................................................................................................ 20
FIGURE 15 FULL CONFIGURATION CONTROLLER INTERFACE ...................................................................................................... 21
FIGURE 16 FULL CONFIGURATION CONTROLLER FINITE STATE MACHINE ...................................................................................... 22
FIGURE 17 THE IMPLEMENTED DESIGN WITH AN EXTERNAL MEMORY FOR STORING PARTIAL BIT‐STREAM FILES .................................... 23
FIGURE 18 IMPLEMENTED DESIGN ‐ SLAVE SIDE ...................................................................................................................... 24
FIGURE 19 DIFFERENTIAL INPUT BUFFER PRIMITIVE (IBUFDS) ................................................................................................. 25
FIGURE 20 THE CONNECTION BETWEEN TWO EVALUATION BOARDS ............................................................................................ 28
FIGURE 21 GENERATED PR REGIONS ON THE FPGA FABRIC ...................................................................................................... 30
FIGURE 22 A SCHEMATIC FPGA STRUCTURE. TAKEN FROM [8] ................................................................................................. 40
FIGURE 23 TMR BASIC PRINCIPLE ........................................................................................................................................ 42
FIGURE 24 TMR ‐ DEVICE LEVEL ......................................................................................................................................... 43
FIGURE 25 XILINX VIRTEX‐5 XC5VLX110T DEVICE. TAKEN FROM [44] ...................................................................................... 44
FIGURE 26 XILINX XUPV5‐LX110T EVALUATION PLATFORM. TAKEN FROM [46] ........................................................................ 46
FIGURE 27 VIRTEX‐5 FPGA SERIAL CONFIGURATION INTERFACE. TAKEN FROM [31] ..................................................................... 47
FIGURE 28 SERIAL CONFIGURATION CLOCKING SEQUENCE. TAKEN FROM [31] ............................................................................. 48
FIGURE 29 MASTER SERIAL MODE CONFIGURATION. TAKEN FROM [31] ..................................................................................... 49
VI
LIST OF TABLES TABLE 1 FPGA VS. ASIC DESIGN ADVANTAGES. TAKEN FROM [10] ............................................................................................. 3
TABLE 2 TOP MODULE (MASTER SIDE) INTERFACE PINS ............................................................................................................ 15
TABLE 3 FAULT‐CLASSIFIER INTERFACE PINS ........................................................................................................................... 17
TABLE 4 PR CONTROLLER PIN DESCRIPTION ........................................................................................................................... 19
TABLE 5 FULL CONFIGURATION CONTROLLER. PIN DESCRIPTION ................................................................................................. 21
TABLE 6 BIT ORDERING FOR ICAP 8‐BIT MODE ..................................................................................................................... 25
TABLE 7 BIT ORDERING ..................................................................................................................................................... 25
TABLE 8 DEVICE UTILIZATION SUMMARY FOR CONFIGURATION CONTROLER (EXCLUDE BITSTREAM MODULE) ........................................ 31
TABLE 9 CONFIGURATION TIMES FOR DIFFERENT PARTIAL BITSTREAMS ......................................................................................... 32
TABLE 10 RESOURCE UTILIZATION OF ICAP CONTROLLER ......................................................................................................... 32
TABLE 11 VIRTEX‐5 DEVICE FRAME COUNT, FRAME LENGTH, OVERHEAD, AND BITSTREAM SIZE [31] .............................................. 39
TABLE 12 PERFORMANCE OVERVIEW OF MITIGATION SCHEMES. PART OF THE TABLE IS TAKEN FROM [12] ......................................... 43
TABLE 13 VIRTEX‐5 (LX110T) DEVICE SPECIFICATION TAKEN FROM [43] .................................................................................... 44
TABLE 14 VIRTEX‐5 CONFIGURATION MODES ........................................................................................................................ 45
TABLE 15 VIRTEX‐5 FPGA SERIAL CONFIGURATION INTERFACE PINS .......................................................................................... 48
1
1 Introduction
As digital systems become large and complex, their dependability is getting more important, particularly
in mission‐critical and safety‐critical applications. Among various available platforms for implementing a
digital system, SRAM‐based Field Programmable Gate Arrays (FPGAs) are increasingly adopted in
embedded systems due to their flexibility in achieving multiple requirements such as low cost, high
performance, and fast turnaround time compared to Fixed Application Specific Integrated Circuits
(ASICs). The most attractive feature of SRAM‐based FPGAs is the ability of re‐programming4 the device in
a few clock cycles, which allows the system implemented on the FPGA to be updated during design
lifetime. This feature is one of the reasons in which SRAM‐based FPGAs are taken into account for
mission‐critical applications where direct maintenance is difficult. This feature is further enhanced by
the introduction of Partial Dynamic Reconfiguration (PDR), which allows reconfiguration partially and on
the fly while the device is operating. Some advantages of using SRAM‐based FPGAs in space applications
are discussed in [1], [2].
Nevertheless, SRAM‐based FPGAs are more susceptible to faults compared to other type of FPGAs and
ASICs. One of these faults, which mostly happen in higher altitude5, is bit‐flop in configuration memory
caused by ionizing radiation [3], [4], [5]. Ionizing radiation (such as neutrons or alpha particles emitted
by natural radioactive isotopes present in device packaging) is able to induce undesired single event
effects (SEEs) in most silicon devices. SEEs that result in temporary damages to the device are called soft
errors. Soft errors in FPGAs often show up as bit‐flops in user flip‐flops, internal block memory and
configuration memory. Bit‐flops within the configuration memory are especially challenging. If these bit‐
flops alter the critical bits (those that cause functional failure if they change state) in the configuration
memory, the function of the design can be corrupted. This is clearly unacceptable for mission‐ or safety‐
critical applications. Thus, appropriate hardening techniques should be used before they can be
deployed.
In general, fault‐tolerant techniques are mostly based on spatial redundancy. However, these
techniques can be combined with FPGA’s re‐configuration capability for recovery. Since the complexity
of modern systems is increasing and utilizing hardening techniques demand higher resources, a single
4 Re‐configuring 5 40000 feet and above
2
FPGA may not suffice to contain the whole system. In this case, multi‐FPGA platforms would be taken
into account.
In this thesis, a generic dynamic partial reconfiguration controller for a fault‐tolerant design based on
Multi‐FPGA is proposed. The final goal is to have a dependable controller that is able to recover all
recoverable faults6 by exploiting the reconfiguration capability of the FPGAs. This controller is able to
correct the SEUs in the configuration memory of the neighbor FPGA by means of Xilinx PDR7 capability.
It can correct and classify soft errors in the configuration memory, in both static and partial
reconfigurable regions. Moreover, the controller, itself, is hardened and it is able to detect and mask its
own errors.
Modern fault‐tolerant architectures using PDR often utilize microprocessors such as PowerPC or
MicroBlaze embedded into FPGA as the main processing unit for the configuration controller; like the
ones presented in [6], [7]. The innovative contribution of this thesis is implementing all necessary units
and components for the FT8 configuration controller generically on the FPGA fabric. Moreover, in this
thesis we focused on multi‐FPGA platforms, which are less discussed in the literatures. We have
proposed a distributed solution where each FPGA on the multi‐FPGA platform is responsible for
monitoring and recovering, in case of faults, the neighbor FPGA on the platform. This method, which is
discussed in [8], will increase the overall reliability in contrast to centralized solution. In addition to this,
the proposed solution in this work is able to correct single or multiple faults (assuming the faults are
detected) inside the FPGA.
The rest of this thesis is organized as follows: Chapter 2 briefly introduces the preliminary aspects of the
problem and introduces the background elements useful to set the basis for understanding the rest of
the thesis. Moreover, other SEU mitigation schemes have been discussed in this chapter. We also
introduce the self‐healing system architecture, which our controller is designed based on that. Chapter 3 describes the proposed controller architecture. Chapter 4 presents the design hardening of the implemented controller. In chapter 5, we present the testing results. Eventually, chapter 6 draws some
conclusions and gives some possible future research directions.
6 Recoverable faults are a kind of faults that do not cause permanent damage to the FPGA fabric 7 Partial Dynamic Reconfiguration 8 Fault Tolerance
3
2 BackgroundandRelatedWork
In this thesis, we proposed a dependable reconfiguration controller for embedded systems on multi‐
FPGA platforms. Our aim is to increase the overall reliability of system by means of PDR capability. The
chapter is structured as follows: Section 2.1 presents the motivations of the proposed work and
introduces the background elements useful to set the basis for understanding the rest of the thesis.
Section 2.2 discuss what the working scenario for this thesis is, and what the characteristics are. In
Section 2.3, we explain the adopted fault model. Section 2.4 presents the self‐healing system
architecture. We follow this architecture in the rest of the thesis. Other mitigation schemes are also
discussed in section 2.5. At last, section 2.6 draws the chapter summary.
2.1 Motivation
Occasionally, electronic devices show erroneous behavior for no explicit reason. By performing several
experimental designs and by considering statistical analysis, scientists and engineers discovered that
background radiation is the reason. These failures are generally rare and could be ignored for common
applications. However, for many applications, such as mission‐critical and safety‐critical applications, it is
important to consider the role of radiation in system reliability. Reliability problems due to radiation
most commonly fall into the category termed single event effect (SEE) and show up as a type of soft
errors called single event upsets (SEU) [9].
Among various available platforms for implementing a digital system, SRAM‐based Field Programmable
Gate Arrays (FPGAs) are increasingly adopted in embedded systems due to their flexibility in achieving
multiple requirements such as low cost, high performance, and fast turnaround time compared to Fixed
Application Specific Integrated Circuits (ASICs). Table 1 compares FPGAs with ASICs in the various
aspects.
Table 1 FPGA vs. ASIC Design Advantages. Taken from [10]
FPGA Design
Advantage Benefit
Faster time‐to‐market No layout, masks or other manufacturing steps are needed
No upfront non‐recurring expenses (NRE) Costs typically associated with an ASIC design
Simpler design cycle Due to software that handles much of the routing, placement, and timing
More predictable project cycle Due to elimination of potential re‐spins, wafer capacities, etc.
Field reprogramability A new bitstream can be uploaded remotely
ASIC Design
Advantage Benefit
Full custom capability For design since device is manufactured to design specs
Lower unit costs For very high volume designs
Smaller form factor Since device is manufactured to design specs
4
FPGA designs present faster time to market and less non‐recurring expenses (NRE). They also have a
simpler design cycle in contrast to ASICs. However, in general, FPGA designs exhibit worse performance
in terms of logic density, circuit speed, and power consumption than ASICs. In [11] the authors
presented empirical measurements quantifying the gap between 90 nm CMOS FPGAs and 90 nm CMOS
Standard Cell ASICs. They observed that for circuits implemented entirely using LUTs and flip‐flops (logic‐
only), an FPGA is on average 40 times larger and 3.2 times slower than a standard cell implementation.
An FPGA also consumes 12 times more dynamic power than an equivalent ASIC on average.
“Although FPGAs used to be selected for lower speed, complexity, volume designs in the past, today’s
FPGAs easily push the 500 MHz9 performance barrier. With unprecedented logic density increases and a
host of other features, such as embedded processors, DSP blocks, clocking, and high‐speed serial at
ever‐lower price points, FPGAs are a compelling proposition for almost any type of design” [10]. The
most attractive feature of SRAM‐based FPGAs is the ability of re‐programming10 the device in a few clock
cycles, which allows the system implemented on the FPGA to be updated during design lifetime. This
feature is one of the reasons in which SRAM‐based FPGAs are taken into account for mission‐critical
applications where direct maintenance is difficult. This feature is further enhanced by the introduction
of Partial Dynamic Reconfiguration (PDR), which allows reconfiguration partially and on the fly while the
device is operating.
In this thesis, we focus on the SRAM‐based FPGAs in Multi‐FPGA platforms. In a SRAM based FPGA, the
combinational and sequential logic are implemented in programmable complex logic blocks (CLBs),
which are customized by loading configuration data (bitstream) in the SRAM cells of the program
memory [12]. Since the functionality of SRAM‐based FPGAs is determined by the configuration memory,
any bit‐flop that alters the critical bits11 in the configuration memory would corrupt the function of
design. Thus, to have a dependable system specifically in a harsh environment, the system on the chip
should be hardened using suitable FT techniques.
2.2 Workingscenario
The working scenario of this thesis is space applications where SEUs are caused by secondary particles.
According to [9] “secondary particles liberated by the collision of a neutron with a silicon atom or from a
contaminant emitting an alpha particle in an electronic device. The neutrons are generated when cosmic
rays and protons from space interact with the atmosphere. The cosmic rays are from both inside (the
sun) and outside (novas and supernovas) of the solar system. The neutrons range in energy from below
1 million electron volts (MeV) to more than 1,000 MeV.”
Although it is possible to protect electronic equipment against these hi‐energy neutrons by means of
shielding, this is not practical for most applications because the amount of material required to make
this shield is prohibitive (e.g., as much as 30 meters of water for neutrons with high energy) [9].In
9 Xilinx Zynq‐7000 technology has already passed 800 MHz 10 Re‐configuring 11 critical bits are those bits that cause functional failure if they change state
5
addition to neutron effects, an SEU could be caused by alpha particles that emitted by natural
radioactive isotopes present in device material and packaging [9].
2.3 AdoptedFaultModel
We can organize the effects from ionizing radiation into three main categories: transient current pulses,
changes in memory values (such as bit‐flops or SEUs), and latch‐up. The first two categories will result in
recoverable (or soft) faults while latch‐up, which can results in sever overheating, melting, or
vaporization, can cause damage to FPGA fabric and will result in non‐recoverable (or hard) faults. Due to
the difficulty of maintenance in mission‐critical applications, we have to add aging effects to the above‐
mentioned categories. Aging effects can also end in non‐recoverable faults. Since the primary concern
for FPGAs are soft‐faults, we need to expand the first two mentioned categories in this section:
1‐ Transient current pulses may change the values of the internal signals or they may strike the
clock line. They may have transient effect and get vanished after a short time or they may
propagate to flip‐flops inputs and get registered. In both cases, they can cause erroneous value
that will lead to an incorrect result at the output. Suitable error detection and masking
technique is necessary to avoid the propagation of an incorrect result to the other modules.
Such approach is discussed in [13], [14]. The fault can, then, be recovered by performing the
reset.
2‐ The second type of recoverable‐faults is change in the memory values. SRAM‐based FPGAs have
two types of memory: The user registers and block RAMs, which store the user data, and the
configuration static memory which stores the configuration bitstream. Any changes in the
configuration memory will modify the functionality of the system implemented inside the FPGA.
The only method to recover the configuration memory is to rewrite the corrupted portion of the
configuration memory by the correct portion of the bitstream. In this work, we concentrate on
hardening the design implemented inside the FPGA against upsets in the configuration memory.
The proposed controller in our research is able to correct single or multiple‐bit upsets (MBUs) in
the configuration memory by performing the partial reconfiguration of the corrupted portion of
the memory or, at the worst case, reconfiguring the whole FPGA.
2.4 Self‐HealingSystemArchitecture
We applied a hybrid fault‐tolerant technique to our multi‐FPGA architecture. In this architecture each
FPGA hosted a portion of the design. This portion on each FPGA is hardened with hardware redundancy
techniques and distributed among available partially reconfigurable Regions (PRR‐1 to PRR‐n).
Partitioning the system into different portion and then into n PR regions is not mentioned here since the
proposed architecture is not depended on it. The hardware redundancy techniques implemented in this
scenario
controller
Partial Re
file [15]. A
of on‐site
design. Pa
operating
configure
without c
being reco
In this sce
reconfigu
modified
contents
loading o
should no
The partia
be update
download
stored in
If these p
fault toler
other logi
Partial R
comprehe
design aft
reconfigu
12 A brief in13 Protecte
are able to
r of the neigh
configuration
According to X
e programmi
artial Reconfi
g FPGA design
s the FPGA, p
ompromising
onfigured.” [1
enario, the FP
rable (PR) re
by means o
of the partia
f a partial bi
ot) be reconfig
al BIT files (PR
ed later durin
ding one of se
an external m
partial bit files
rance by reco
cs remains fu
econfiguratio
ensive solutio
ter reconfigu
ration of a b
ntroduction toed against radia
detect, loca
bor FPGA for
n is the modif
Xilinx Partial
ng and re‐p
guration (PR)
n by loading a
partial BIT file
g the integrity
15] The basic
Figu
PGAs are stru
egions. The p
f partial reco
l bit file. The
t file. The st
gured.
R_Bit_x.bit) s
ng the design
everal availab
memory.
s are stored i
onfiguring the
unctioning an
on can be
on for PR des
ration. There
block has not
hardware redation
ate and mas
r recovery12.
fication of an
Reconfigurat
rogramming
) takes this fle
a partial conf
es can be do
y of the appli
block diagram
ure 1 Basic Prem
uctured into t
ortion of the
onfiguration
static logic r
atic region c
hould be calc
lifetime. As
ble partial bit
in an protect
e faulty portio
d are comple
done via JT
sign regardin
e are some st
t been succe
undancy is ava
6
k faults and
operating FP
ion User Guid
without goin
exibility one‐s
figuration file
wnloaded to
cations runni
m of Partial R
mise of Partial Re
two separate
e system that
controller. T
remains funct
ontains the o
culated offline
shown in Fig
files, PR_Bit_
ted memory13
on of the FPG
etely unaffect
TAG, SelectM
ng to the cap
tatus registe
eeded. Furthe
ailable at Appe
inform the
PGA design by
de, “FPGA tec
ng through r
step further,
e, usually a pa
o modify reco
ing on those
Reconfiguratio
econfiguration
e regions: a s
t is impleme
The reconfigu
tioning and is
other parts o
e prior the FP
ure 1, each P
_A.bit to PR_
3, partial reco
GA with a cor
ted.
MAP, Maste
pability of do
rs in ICAP wh
ermore, it is
endix B.
faults to th
y loading a pa
chnology prov
re‐fabrication
allowing the
artial bit file.
onfigurable re
parts of the d
on is illustrat
static region a
ented in the
urable logic i
s completely
of the design
PGA design; h
PR modules c
_Bit_D.bit. The
onfiguration c
rect partial b
er‐Serial, or
ing readback
hich indicate
possible to
he reconfigur
artial configur
vides the flex
n with a mo
modification
After a full b
egions in the
device that ar
ed in Figure 1
and several p
PR regions ca
s replaced b
unaffected b
which canno
however, they
can be modifi
ese bit files c
can improve
bitstream whi
ICAP. ICAP
k and verifyin
an error if p
implement a
ration
ration
xibility
dified
of an
bit file
FPGA
re not
1.
partial
an be
by the
by the
ot (or
y may
ied by
can be
FPGA
le the
is a
ng the
partial
a CRC
7
checker in the PR controller to check the CRC for the received file before forwarding it to the ICAP. By
using these two techniques, (monitoring the ICAP registers and CRC checking) we can be sure that the
target FPGA is partially reconfigured correctly.
Using PR approach has some advantages and disadvantaged. These include:
Advantages:
• Partial BIT files are calculated offline and stored in the FPGA prior the FPGA design.
Therefore, the necessary controller for doing partial reconfiguration can be smaller than the
other methods.
• BIT files can be updated later during the design life time
• The PR flow is straightforward and can be done from beginning to the end in Xilinx
PlanAhead™ software
• Function of each partial reconfigurable region can be changed completely by using a
different BIT file (ability to time multiplex hardware dynamically)
• Many interfaces exists to perform partial reconfiguration from outside
• Do not need to know the memory address of the PR modules
Disadvantages:
• Extra memory is needed to store both full configuration and partial reconfiguration BIT files
• Not all implementation options are available to the PR flow. (e.g. techniques perform
optimization across the entire design) [15]
• PR design affects the performance. In general, one should expect 10% degradation in Clock
Frequency, and expect not to exceed 80% slices in Packing Density. [15]
• Routing challenges may occur if the reconfigurable region is too small or is constructed of
non‐rectangular shapes. [15]
We considered a distributed solution for this Multi‐FPGA design in which each FPGA is responsible to
monitor its neighbor FPGA, and in case of fault, recover the neighbor FPGA to a correct state14. Another
approach could be a centralized solution that a rad‐hard FPGA monitor all other FPGAs in a design. The
main supremacy of distributed to a centralized solution is that, there is no need for a controller to be
resided in a separate device. It can be implemented alongside the main system on the same FPGAs [8].
Moreover, the distributed solution is independent of the number of FPGAs whereas in the centralized
solution the number of FPGAs must be defined prior the design. In both scenarios the original
configuration bitstreams should be protected against SEUs. We will discuss this issue in section 4.4. The
Figure 2 illustrates the basic principle of distributed and centralized solution.
14 By means of a reconfiguration controller
Like any
implemen
hard proc
processor
registers
gives a be
only meth
resource
In our pro
any proce
the contro
2.5 SE
Any time
way to re
applicable
This durat
Figure 2 FT sy
other digit
ntation for th
cessor (such a
r, as shown
for read/writ
etter flexibility
hod for harde
utilization.
oposed hard‐
essors. Imple
oller. In addit
EUMitigat
the FPGA is
ecover the F
e to many ap
tion is not to
ystem on Multi‐
al designs,
he above‐me
as MicroBlaze
in Figure 3,
te operation
y to the user,
ening a soft p
Figure 3 a conf
‐ware based
menting in th
tion to this, im
tionSchem
powered up,
FPGA to a co
pplications be
olerable for m
‐FPGA platform.
there is tra
entioned arch
e, PowerPC o
can manage
of Xilinx XPS
, the processo
processor inv
figuration contro
solution, the
his way will l
mplementing
mes
, all its config
orrect conditi
ecause it will
many applicat
8
Distributed sol
adeoff betwe
hitecture. In
or ARM) shou
e the reconf
S HWICAP co
or itself is a p
olves triplicat
oller block‐diagr
controller is
et the design
in hardware
guration cont
ion is to pow
cause the FP
tions. In thes
lution (left); Cen
een softwar
software‐bas
uld be embed
figuration pr
ore [16]. Alth
point of failur
tion, which c
ram based on M
s implemente
ner to apply a
would be spa
tents are refr
wer cycle it.
PGA to stop fu
se application
ntralized solutio
re‐based and
sed impleme
dded into the
ocess by set
hough softwa
re and should
could be very
MicroBlaze
ed purely on
any available
ace/speed op
reshed. There
However, th
unctioning fo
ns, other mit
on (right)
d hardware‐b
entation, a so
e design. The
tting the req
are‐based so
d be hardened
y costly in ter
hardware wi
e FT techniqu
ptimize.
efore, the sim
his method i
or several sec
igation techn
based
oft or
en the
quired
lution
d. The
rms of
ithout
ues on
mplest
is not
conds.
niques
9
should be deployed. Moreover, the state of the FPGA will be lost and a synchronization technique
should be deployed to synchronize the FPGA with other processing elements in the design.
Another mitigation scheme is ''bitstream scrubbing and readback'' (or simply scrubbing) which means
reading back the configuration bitstream stored in the configuration memory, comparing it with an
original one and correcting any affected configuration bits. The process is continuously performed,
independently of the occurrence of a soft error. Such approach is discussed in [17], [18]. Since this
approach is blind, it will introduce latency in detecting a fault and it may cause much more overhead
compared to the other approaches because of continues readback and checking15. Some works have
been carried out recently to make the scrubbing faster and on demand. In [19] the author proposed a
constraint driven re‐placement method to reduce the number of sensitive configuration frames and
consequently the scrubbing time.
The faster and on‐demand solution is the modification of an operating FPGA design by loading a partial
configuration bitstream. Partial reconfiguration is only a recovery technique which means soft errors
should be detected (and located) first, before they can be repaired. Detection and masking could be
performed by well‐known hardware redundancy techniques, either triple modular redundancy (TMR)
[20], [21], [22], [23] or duplication with comparison (DWC) combined with concurrent error detection
(CED) [24].
A first implementation for this kind of reconfiguration controller has been presented in [25]. The author
in the mentioned paper propose a distributed mesh topology in which each FPGA monitors the neighbor
FPGA in a multi‐FPGA platform and triggers the reconfiguration of the faulty portion of the neighbor
FPGA. However, the proposed solution in the mentioned work for hardening the reconfiguration
controller is based on blind readback and checking which may introduce delay in recovery. Another work
is presented in [16] where the author compares different software‐based solution for reconfiguration
controller to achieve the minimum reconfiguration time. However, since the reconfiguration controller
is implemented in the embedded processor, hardening the controller is very difficult. The latest study in
this genre is presented in [26] where the author implemented a hardware‐based ICAP controller for
doing partial reconfiguration. We will compare these approaches in terms of speed and resource
utilization with our proposed controller in the upcoming discussion.
2.6 Summary
In this chapter, we presented the necessary requirements for a multi‐FPGA system in a mission‐critical
application. We talked about the importance of the SRAM‐based FPGAs and we introduced their
limitation in different environments. We also included a brief comparison in Performance, Consumption,
Cost, and Flexibility between SRAM‐Based FPGAs and similar embedded processing units. Then, we
show that although SRAM‐based FPGAs are attractive not only in commercial markets, but also in the
mission critical and safety critical application, special hardening techniques must be used in a harsh
environment. Moreover, we described our working scenario and talked about its characteristics. Next,
15 For more information regarding scrubbing and Xilinx SEM controller please refer to Appendix A.
10
we mentioned the main types of fault that threaten electronic devices in this environment, and we
discussed the Radiations and its effects on the electronic devices in general and on the SRAM‐based
FPGAs in particular.
Furthermore, we introduce our self‐healing system architecture, which our controller is designed based
on that. We have also discussed other possible approaches for increasing the reliability of SRAM‐based
FPGA’s design. We performed a brief literature analysis on similar approaches as well.
In the next chapter, we will discuss our proposed solution for this scenario.
11
3 ProposedControllerArchitecture
The main problems in fault tolerant system is to first detect error during system operation, then locate
the error as fast as possible, next, recover the system to a normal condition and last, bring the system
back to the correct state. Error detection and localization could be done by means of online checkers
like the one presented in [27]. In this paper, the author presents an on‐line testing technique for TMR.
Another approach is to combine 2‐rail logic and self‐checking to have a concurrent error detection
technique like the one presented in [24].
In this thesis, we only focus on fault recovery by means of PDR capability. Our proposed solution is
based on the design methodology presented in [8]. As shown in Figure 2, each FPGA (FPGAi) in our
architecture hosted a reconfiguration controller. The main responsibilities of these controllers are as
follow:
1‐ The controller has to monitor the error signals of the PR regions, static region, and the
reconfiguration controller of the next FPGA (FPGAi+1) in the proposed mesh topology.
2‐ In case of any error in the FPGAi+1 the controller should perform appropriate action to recover
the FPGAi+1 to a correct condition by means of reconfiguration.
3‐ The controller should be hardened itself in a way that if a fault occur in the controller, it should
detect, locate and mask the fault and inform the reconfiguration controller in the FPGAi‐1 for
performing the recovery.
By considering these responsibilities, the controller can be organized into four main parts: Fault
Classifier, Partial Reconfiguration (PR) Engine, Full Reconfiguration Engine, and Bitstream Module. The
main block diagram of the controller is illustrated in Figure 4.
The fault
technique
with the
whether t
the FPGA
may also
The origin
responsib
possible
reconfigu
could not
master‐se
The contr
can be ex
of FPGAs.
In the im
Master) a
this FPGA
In this sec
compone
classifier ha
e [28], [29]. I
address of re
the error is co
i+1, the Fault
be initiated if
nal bitstreams
ble to provid
speed. To a
ration is don
t be used fo
erial configura
roller in this
tended to an
mplemented s
and the syste
A as slave).
ction, we des
nts in the ma
Figu
as to monito
f an error is
elevant parti
orrected or n
Classifier wil
f PR Engine co
s in our desig
de the neces
achieve the
ne via Interna
or full reconf
ation mode a
thesis is imp
y number of
solution, the
em which sho
scribe the imp
aster side and
re 4 Reconfigura
r the error s
detected on
al bitstream.
not. If the err
l initiate the
ould not fix a
gn are stored
ssary protoco
maximum
al Configurat
figuration and
t 10 Mbps.
lemented an
FPGAs; since
configuratio
ould be harde
plemented co
then the com
12
ation Controller
signals, whic
a PR region,
. Then, it wo
or is detected
Full Reconfig
n error in a P
in a rad‐hard
ol for comm
speed for r
ion Access P
d, for this re
d tested on t
the impleme
n controller
ened by mea
ontroller (Fig
mponent in th
r block diagram
ch are encod
, the Fault Cl
ould monitor
d inside the s
guration Engin
R region afte
d external me
unication wi
reconfiguratio
Port (ICAP) at
eason; the fu
two FPGA pla
ented solution
reside in on
ns of PDR re
ure 6) in deta
he slave side.
ded with two
lassifier initia
the error sig
static region
ne. Full Reco
er a specific nu
emory. The Bi
ith this mem
on, the act
t 3.2 Gbps. H
ull reconfigu
atforms (Figu
n is independ
ne FPGA (we
esides in anot
ails. We start
.
o‐rail coding
ates the PR E
gnals again t
or PR contro
nfiguration E
umber of try.
itstream Mod
mory at max
of doing p
However, the
ration is don
ure 5); howev
dent of the nu
call this FPG
ther FPGA (w
t by explainin
(TRC)
Engine
to see
ller of
Engine
.
dule is
imum
partial
e ICAP
ne via
ver, it
umber
GA as
we call
ng the
LED1
LED2
LED3
Virtex‐5 Evalua
(RointeFPG
RM1
R W
01 0010 00
10 00
10 11
RM2
R W
01 00
10 0010 00
10 11
MuxMux
Figu
ation board (Slave S
FPGA‐2 (Target)
CPLD‐2outing full configuratioerface to the dedicatedGA‐2 configuration pin
RM3
R W
01 00
10 0010 00
10 11
Sta
Mux
Figu
re 5 slave FPGA
Side)
n d s)
Full Configura
atic Parts
ICAP
ure 6 Configurat
13
A (left) and mast
Partial Reconfiguration
interface
P(M
ation
Error Signals
conc
c
tion Controller B
ter FPGA (right)
Virtex‐
FP
(Routint
R ControllerMaster Side)
Full nfiguration controller
Fault classifier
Block Diagram
‐5 Evaluation board
PGA‐1 (Master)
CPLD‐1ng the platform flash to the FPGA‐1)
BRAM
Bitstream‐1
Bitstream‐2
Bitstream‐3
(Master Side)
Full configuration oFPGA‐2 via master se
Platform Flashof
erial
LED1
LED2
LED3
14
3.1.1 ImplementeddesignintheMasterside
The configuration controller has to be able to reconfigure the neighbor FPGA (slave), fully or partially.
Moreover, it should decide whether it has to perform full configuration or partial reconfiguration based
on some existing rules. The block diagram of the Master side is shown in Figure 7.
The fault‐classifier module receives the error16 signals from the slave FPGA and sends a request to the
PR Controller. Then, the PR Controller initializes the ICAP interface and sends the selected bitstream to
the slave FPGA. If the error is in static region or the number of errors in the PR regions exceed a specific
amount (three in our case), the Fault‐classifier classify these errors as “non‐recoverable by PDR” and
sends the request to the full configuration controller for downloading the full bitstream to the slave
FPGA. The partial bitstream files are stored on the on‐chip memory17, and the full bitstream is stored on
the Platform Flash. We will come back to this later that why we route the platform flash to the FPGA via
an onboard CPLD.
Figure 7 Block Diagram of the Master Side and the Top module signals
16 These errors can be in PR regions or static region. The detection of these errors is the responsibility of the user and can be done by means of FT techniques. In this thesis, we assume that there is an error detection mechanism (Such as 2‐rail logic combined with self‐checking) on the slave side. 17 These bitstreams will be moved to an off‐chip memory later.
The interf
Pin Na
err_1(1:0)
err_2(1:0)
err_3(1:0)
CLK
RST
RAM_DIN
SLAVE_CCLK
SLAVE_DON
SLAVE_INIT_
ICAP_INPUT
ICAP_INPUT
number_of_
ICAP_CE
ICAP_CLK
ICAP_WRITE
face of the PR
ame T
I
I
I
I
I
I
K I
NE I
_B I
T_N(15:0) O
Diff
T_P(15:0) O
Diff
_err(2:0) O
O
O
E O
R controller is
Tab
Type
nput Two‐
nput Two‐
nput Two‐
nput Main
nput Main
nput Seriaconn
nput Conficonn
nput Activ0 = S1 = S
nput
Beforfull cdrainrecon0 = C1 = N
Output ferential
ICAP is ide
Output ferential
ICAP is ide
Output Thesehapp
Output Activ
Output ICAP
Output ICAP
s shown in Fig
Figure 8 PR
ble 2 Top modul
‐rail error signal
‐rail error signal
‐rail error signal
n 100 MHz clock
n reset
l configuration ected to the D0
iguration clock ected to the CC
ve High signal indlave FPGA not colave FPGA config
re the Mode pinonfiguration of n active Low ounfiguration: RC error
No CRC error
read data bus. entical to the Sel
read data bus. entical to the Sel
e signals which pened since the m
ve‐Low ICAP inte
interface clock.
data flow dire
15
gure 8. Table
R controller inte
e (Master side)
from PR module
from PR module
from PR module
data input, syof the Platform
source for allLK of the slave F
dicating full confonfigured gured
ns are sampled, the slave FPGA.tput indicating
The bus width dlectMAP interfac
The bus width dlectMAP interfac
are connectedmost recent full‐
erface select. Equ
The data are sa
ction. 0=WRITE
2 describes t
erface
interface pins
Description
e one
e two
e three
ynchronous to Flash.
l configuration FPGA.
figuration is com
INIT_B is an inp. After the Modewhether a CRC
depends on ICAce.
depends on ICAce.
to the LEDs in‐configuration.
uivalent to CS_B
mpled on the ris
E, 1=READ. Equi
he top modu
n
rising RAM_CC
modes except
mplete:
put that can be e pins are sampC error occurred
P_WIDTH param
P_WIDTH param
ndicate number
B in the SelectMA
sing edge of this
ivalent to the R
le interface.
CLK edge. This
t JTAG. This sig
held Low to deled, INIT_B is and during full or
meter. The bit or
meter. The bit or
of errors whic
AP interface.
s clock.
RDWR_B signal
Pin is
gnal is
elay the n open‐partial
rdering
rdering
h have
in the
RAM_CCLK
RAM_CE_B
RAM_INIT_
Slave_D0
SLAVE_PRO
As it can
each com
3.1.1.1
The purp
configura
O
O
B O
O
G_B O
be seen in F
ponent in de
Fault‐Classi
pose of the
tion is neede
Selec
Output Synchclock
Output Chip mode
Output Correcoun
Output Confi
Output Activslave
igure 9, we h
tail.
Fig
ifierModule
Fault‐Classifie
ed to restore
ctMAP interface
hronous clock fok.
Enable Output.e, the address co
esponds to OE/ter reset and th
iguration DATA
ve‐Low asynchroe FPGA.
have five ma
gure 9 Modules
e
er is to ana
the faulty mo
16
.
or Platform Flas
When CE is Higounter is reset, a
/RESET_B of Plae DATA output i
input pin for the
onous full‐chip r
in componen
inside the TOP (
lyze the inp
odule in the s
sh. Data is put o
gh, the Platformand the DATA p
atform Flash. Wis in a high‐impe
e slave FPGA
reset. This pin is
nts inside TO
(Master Side)
ut error sign
slave side to
on RAM_DIN on
m Flash is put inins are put in a h
When Low, this edance state
s connected to t
OP. In the foll
nals and dec
bring it back
the rising edge
nto low‐power shigh‐impedance
pin holds the a
the PROGAM_B
lowing, we d
cide what ki
k to its initial
of this
tandby e state.
address
B of the
iscuss
nd of
state.
In this de
These err
target FP
reconfigu
need for e
needed to
failure itse
Figure 10
descriptio
Pin Na
CLK
RST
START_WRI
SLAVE_DON
PR_DONE
Err_1(1:0)
Err_2(1:0)
Err_3(1:0)
Start_PR_ou
Start_full
Number_of_
PR_Select_o
Current_sta
Err_in_class
sign, master
or signals ind
PGA. Then,
uration or ful
error signals.
o be sent to
elf, and it sho
0 illustrate t
on.
ame Typ
Inp
Inp
TE Inp
NE Inp
Inp
Inp
Inp
Inp
ut Out
Out
_err(2:0) Out
out(1:0) Out
ate(4:0) Out
sifier(1:0) Out
FPGA detect
dicate whethe
the fault‐cla
ll configurati
In this case,
the target F
ould be monit
the fault‐cla
pe
put Main clocClock
put Main rese
put This signabeginning
put Active Hig0 = Slave 1 = Slave
put Indicates
put Two‐rail e
put Two‐rail e
put Two‐rail e
put A request
put A request
put These sigsince the
put Two‐bit sbe sent to
put This outp
put This two‐
s which part
er the fault is
assifier in th
on. Fault‐clas
the fault‐clas
FPGA. Howev
tored by the m
ssifier interf
Figure 10 Fa
Table 3 Fault‐
ck for fault class
et. This pin is con
al is connectedg of the state ma
gh signal indicatFPGA not configFPGA configured
that the partial
error signal from
error signal from
error signal from
t to the PR Contr
t to the Full conf
nals that are conmost recent ful
signal that indicao the slave side.
ut buffer is need
rail signal indica
17
of the target
s in reconfigu
he master s
ssifier can al
ssifier may on
ver, in this ca
master period
face and Ta
ault‐Classifier int
‐Classifier interf
D
ifier. could be u
nnected to the m
to a debounceachine. It will be
ing full configuragured d
reconfiguration
m PR module one
m PR module two
m PR module thre
roller to start pa
figuration Contro
nnected to the Ll configuration.
ates which portIt is sampled on
ded for one‐hot
ates the presenc
t FPGA is fau
urable module
ide decides
so reside in
nly signal the
ase, the fault
dically.
ble 3, descr
terface
face pins
Description
p to 100 MHz, W
main system res
ed push buttone removed in the
ation is complet
has been finish
e
o
ee
artial reconfigura
oller to start ful
LEDs indicate nu
ion of the slaven the rising edge
state encoding.
e of error in the
ulty (by mean
es or in the s
whether to
the slave sid
master side
t‐classifier be
ribes the fa
We connect this
set push button
ns and used to e final design.
te:
ed
ation.
l reconfiguration
umber of errors,
e is faulty and we of the start_PR
For more detai
e classifier state
ns of error sig
static region o
o perform p
de, eliminatin
which bitstre
ecomes a po
ault‐classifier’
s to the 6.25 MH
make a pause
n.
which have hap
which bitstream R_out.
ls please refer to
machine
gnals).
of the
partial
ng the
eam is
oint of
’s pin
Hz DCM
at the
ppened
should
o [30]
The error
slave FPG
diagram i
“00” on e
PR contro
3.1.1.2
Partial Re
the act of
comprehe
design aft
reconfigu
side. The
the ICAP.
sure that
mentione
correctly.
interface
r signals will i
GA. Then by c
s shown in Fi
err_1 signals a
oller.
PRControll
configuration
f doing partia
ensive solutio
ter reconfigu
ration of a b
CRC checker
By using the
the target F
ed methods h
These featu
and Table 4,
inform the fa
onsidering th
gure 11. For
and the fault
Figure
ler
n can be done
al reconfigura
on for PR des
ration. There
block has not
r in the PR co
ese two techn
PGA is partia
has been imp
ures can be c
describes the
ault‐classifier
he type of err
instance, if th
t‐classifier wil
11 Fault Classif
e via JTAG, Se
ation is done
sign regardin
e are some st
t been succee
ontroller chec
niques, (mon
ally reconfigu
plemented ye
considered as
e PR Controlle
18
that there is
ror the fault
he error is in
ll go to Init_P
fier finite state m
electMAP, Ma
e via Internal
ng to the cap
tatus registe
eded. Furthe
cks the CRC f
nitoring the IC
red correctly
et. Now, we
s a future wo
er pin descrip
s an error in
classifier per
the IRA_1, w
PR state, whic
machine diagram
aster‐Serial, o
Configuratio
pability of do
rs in ICAP wh
rmore, we ca
for the receiv
CAP registers
y. In the curre
only focus o
ork. Figure 1
ption.
the PR or sta
form a prope
we will get the
ch sends the
m
or ICAP. In the
on Access Por
ing readback
hich indicate
an add a CRC
ved file befor
s and CRC ch
ent design, n
on the overa
10 illustrates
atic regions o
er action. The
e stream of “1
PR request t
e proposed d
rt (ICAP). ICA
k and verifyin
an error if p
C checker at
re forwarding
ecking) we c
one of the a
all system to
the PR Cont
of the
e FSM
11” or
to the
esign,
AP is a
ng the
partial
slave
g it to
an be
bove‐
work
troller
Pin Na
CLK
START_WRI
PR_Select(1
PR_DONE
ICAP_CE
ICAP_write
ICAP_CLK
Current_sta
Err_in_PR(1
ICAP_INPUT
ICAP_INPUT
The FSM d
the rising
after one
slave FPG
bit word
complete
asserted
reconfigu
been test
ame Ty
Inp
TE Inp
1:0) Inp
Out
Out
Out
Out
ate(4:0) Out
1:0) Out
T_P(15:0) Out
T_N(15:0) Out
diagram of th
edge on the
clock cycle. I
A is started. A
is sent to t
ly, the FSM
at the fina
ration. The IC
ed and verifie
pe
put Main clocDCM Cloc
put This signa
put Two‐bit sbe sent toclassifier
tput Indicates
tput Active‐Lo
tput ICAP dataSelectMA
tput ICAP inte
tput This is therefer to [
tput This two‐
tput ICAP readidentical
tput ICAP readidentical
he PR controll
e start_PR rec
n the third st
At the rising e
he slave FPG
enters its fin
al state to
CAP clock can
ed in differen
Figure
Table 4 PR Co
ck for PR Controck.
al is connected t
signal that indicao the slave side.module
that the partial
ow ICAP interface
a flow direction.AP interface.
rface clock. The
e output buffer,30]
‐rail signal indica
d data bus. The bto the SelectMA
d data bus. The bto the SelectMA
ler is shown i
ceived. To en
tate, ICAP_CL
edge of each
GA. After on
nal state and
inform the
n work correc
nt clock speed
19
e 12 PR controll
ontroller pin des
D
oller. It could be
to Start_PR_out
ates which portio It is sampled on
reconfiguration
e select. Equival
0=WRITE, 1=RE
data are sample
which is needed
ates the presenc
bus width depenAP interface.
bus width depenAP interface.
n the Figure
nable ICAP, w
LK is enabled
clock, the RA
e partial bits
deactivate t
fault‐classifie
ctly at the fre
ds from 6.25 M
er interface
scription
Description
up to 100 MHz,
of the fault clas
on of the slave in the rising edge
n has been finish
lent to CS_B in t
EAD. Equivalent t
ed on the rising
d for one‐hot st
ce of error in the
nds on ICAP_WI
nds on ICAP_WI
13. The contr
we first assert
and the proc
AM address‐co
stream (1217
the ICAP prim
er about th
equency up t
MHz to 100 M
We connect thi
ssifier module
s faulty and whie of the start_PR
hed
the SelectMAP in
to the RDWR_B
edge of this cloc
ate encoding. Fo
e PR controller st
DTH parameter.
DTH parameter.
roller enters t
t ICAP_write
cess of sendin
ounter is incr
74 words in
mitive. More
he completio
to 100 MHz.
MHz.
s to the 6.25 MH
ich bitstream shR_out of the fau
nterface.
signal in the
ck.
or more details p
tate machine
. The bit orderin
. The bit orderin
the first state
and then ICA
ng bitstream t
reased and on
our case) is
eover, PR_DO
on of the P
The controlle
Hz
ould lt
please
g is
g is
e after
AP_CE
to the
ne 16‐
s sent
ONE is
Partial
er has
3.1.1.3
3.1.1.3.1
In this de
for the co
slave side
side. We
master FP
FPGA. Aft
the Platfo
Slave FPG
procedure
18 Informat19 Here the
FullConfigu
Implemen
sign, a maste
onfiguration c
e, master FPG
utilized CPLD
PGA can initi
ter releasing t
orm flash and
GA signals a D
e. A block dia
Virtex‐5
tion about Virte Slave FPGA m
Figure
urationCont
ntedcircuitf
er‐serial conf
clock. A dedi
GA, and even
D to access th
ate a full con
the PROGRAM
receiving the
DONE to the f
gram of the f
5 Evaluation board (Slave
FPGA‐2 (Target)
CPLD‐2(Routing full configuratinterface to the dedicatFPGA‐2 configuration p
tex 5 configurameans the FPGA
e 13 PR controll
troller
forfull‐conf
iguration18 in
cated configu
ntually, CPLD
he hardwired
nfiguration b
M_B, the Slav
e bitstream d
ull configurat
full configurat
e Side)
ion ted ins)
Full Configuratio
Figure 14 Co
ation modes arA in the slave s
20
er finite state m
figurationco
nterface is im
uration pin o
on the mast
dedicated co
y lowering th
ve FPGA ente
data one bit p
tion controlle
tion system is
n
Full configuratiocontroller
omplete block d
re made availabside.
machine diagram
ontroller
plemented. T
of the slave F
ter side to th
onfiguration
he dedicated
ers its configu
per clock. Afte
er to inform c
s shown in Fig
Virtex‐5 Evaluation
FPGA‐1 (Maste
CPLD‐1(Routing the platform
to the FPGA‐1)
on r
diagram
ble at Appendi
m
The slave FPG
FPGA is route
he platform f
pins for full
d PROGRAM_
uration mode
er configurat
completion of
gure 14.
n board (Master Side)
er)
m flash Full configuratio
FPGA‐2 via master
ix D.
GA19 is respo
ed via CPLD o
lash in the m
configuration
_B pin of the
and start clo
tion is finishe
f the configur
Platform Flashon of
r serial
nsible
on the
master
n. The
slave
ocking
d, the
ration
3.1.1.3.2
Figure 15
Pin Nam
CLK
RST
START_conf
RAM_DIN
SLAVE_CCLK
SLAVE_DON
SLAVE_INIT_
RAM_CE_B
RAM_INIT_
RAM_CCLK
SLAVE_D0
SLAVE_PRO
Current_sta
Err_in_full(1
TheFull‐C
shows the fu
me Type
Input
Input
fig Input
Input
K Input
NE Input
_B Input
Outpu
B Outpu
Outpu
Outpu
G_B Outpu
ate(4:0) Outpu
1:0) Outpu
Configuratio
ull‐configurati
Figu
Table
e
t Main clock DCM Clock.
t Main reset
t This signal iindicates th
t Serial configthe D0 of th
t Configuratiothe CCLK of
t Active High0 = Slave FP1 = Slave FP
t
Before the configuratioactive Low o0 = CRC erro1 = No CRC
ut Chip Enablethe address
ut Correspondreset and th
ut Synchronou
ut Configuratio
ut Active‐Low FPGA.
ut This is the orefer to [30
ut This two‐ra
on‐Controlle
ion‐controller
ure 15 Full Conf
e 5 Full configura
for full config co This clock is for
which connecte
is connected to he start of full co
guration data inhe Platform Flas
on clock source f the slave FPGA
signal indicatingPGA not configurPGA configured
Mode pins are on of the slaveoutput indicatinor error
e Output. Whens counter is rese
ds to OE/RESET_he DATA output
us clock for Platf
on DATA input p
asynchronous f
output buffer, w0]
il signal indicate
21
erarchitectu
r interface an
iguration Contro
ation controller.
D
ontroller. It coulr internal state m
ed to the main re
Start_full of theonfiguration.
nput, synchronoh.
for all configur.
g full configuratred
sampled, INIT_FPGA. After thg whether a CRC
n CE is High, thet, and the DATA
_B of Platform Fis in a high‐imp
form Flash. Data
pin for the slave
full‐chip reset. T
which is needed
es the presence o
ure
nd Table 5 sho
oller Interface
. Pin description
Description
d be up to 100 Mmachine and diff
eset push button
e fault classifier
us to rising RAM
ation modes ex
ion is complete:
_B is an input thhe Mode pins aC error occurred
e Platform Flash A pins are put in
Flash. When Lowedance state
a is put on RAM_
FPGA
This pin is conn
d for one‐hot sta
of error in the fu
ows the pin d
n
MHz, We connecferent from conf
n
module. The ris
M_CCLK edge. Th
xcept JTAG. This
:
hat can be heldare sampled, INd during full or p
is put into low‐a high‐impedan
w, this pin hold
_DIN on the risin
ected to the PR
ate encoding. Fo
ull‐config‐contro
description.
ct this to the 6.2figuration clock
sing edge on this
his Pin is connec
signal is connec
d Low to delay tNIT_B is an opepartial reconfigu
‐power standby ce state.
ds the address c
ng edge of this c
ROGAM_B of th
or more details
oller state machi
25 MHz (CCLK)
s signal
cted to
cted to
the full n‐drain ration:
mode,
counter
lock.
e slave
please
ine
The FSM d
In this co
connected
received f
ns21. The
controller
3.1.1.4
The Digita
loop, digi
DCM was
DCM beca
instead of
this case
using slav
3.1.1.5
A simple
write sign
3.1.1.6
Up to now
against an
memory.
20 indirectl21 This is th22 This is th23 This is th
diagram of th
ntroller, we c
d to the inter
from fault‐cla
n it releases
r finishes its jo
DigitalCloc
al Clock Mana
tal frequency
used to redu
ause, it can w
f master seri
a DCM is ne
ve serial mode
Debouncem
debounce m
nals. The debo
BitStreamM
w, the partia
ny SEU; there
Since we do
y via two CPLDhe minimum rehe maximum che maximum c
he Full Config
Figure 16
control the P
rface of the p
assifier, the f
s the PROGR
ob by enterin
ckManager
ager (DCM) is
y synthesizer
uce 100 MHz
work with 10
al configurati
eeded. In pra
e.
module
module is imp
ounce module
Module
al bit‐stream
efore, in the
not access t
Ds and one FPGequired time folock frequencylock frequency
uration Contr
6 Full Configurat
PROGRAM_B
latform flash
full configura
RAM_B and m
ng the done st
(DCM)
a primitive in
r, digital pha
clock freque
00 MHz clock
ion, the maxi
ctice, the ma
plemented to
e for start‐wr
files are sto
final produc
to any Rad‐H
GA or PROGRAM_y of Platform Fy, which we ha
22
roller is show
tion Controller f
pin of the Sla
on the slave
tion controlle
monitors the
tate.
n Xilinx FPGA
se shifter, or
ency. In fact, t
directly; how
imum clock s
aximum clock
o debounce t
rite is not sho
red in on‐chi
ct, these files
ard memory
B to remain aslash ve reached for
wn in the Figur
finite state mach
ave FPGA. Th
side20 direct
er lowers the
e SLAVE_DON
A and can be u
r a digital sp
the impleme
wever, if a sla
speed should
k speed shou
the input pus
own in Figure
ip BRAMs. Th
s should be m
in this proje
sserted
r XCF32P.
re 16.
hine
he other conf
ly. When Star
e PROGRAM_
NE signal; w
used to imple
pread spectru
nted design d
ave serial con
be less than
uld not excee
sh‐buttons fo
9.
hese files sho
moved to a r
ect, we have
figuration pin
rt_full comm
_B for at leas
hen received
ement delay lo
um. In this d
does not nee
nfiguration is
20 MHz22, a
ed 16 MHz23
or reset and
ould be prot
adiation‐hard
used a simp
ns are
and is
st 250
d, the
ocked
esign,
ed any
s used
nd, in
when
start‐
tected
dened
le I2C
23
memory to test our design. The new block diagram of the whole system with an external memory for
storing partial bit files are shown in Figure 17.
Figure 17 the implemented design with an external memory for storing partial bit‐stream files
In this design, the responsibility of the Bit Stream Module is to refresh the content of BRAM every n
minutes. This refresh interval could be changed based on the application and the environment in which
the system is deployed. Another solution for a Bit Stream Module is to send the data from external
Memory to PR controller directly. This solution is suitable when the size of the bitstream files is too large
and it is not possible to store all of them on on‐chip memory at the same time. However, in this case,
the interface speed of the external memory would limit the partial reconfiguration speed and we could
not benefit from 400MB/S24 configuration speed anymore. Since the size of the Bit files is small enough
in our project, we kept the main idea of using BRAM and we add a bit stream module to refresh the
content of BRAM periodically.
3.1.2 Implementeddesignintheslaveside
Figure 18 shows the block diagram of the design in the slave board. As previously mentioned, we
assume that the required system is implemented in the slave side. This part consists of partial
reconfiguration regions and a static part.
24 This is the maximum reachable ICAP speed
24
Virtex‐5 Evaluation board (Slave Side)
FPGA‐2 (Target)
CPLD‐2(Routing full configuration interface to the dedicated FPGA‐2 configuration pins)
Partial Reconfiguration
interface
RM1
R W
01 0010 0010 0010 11
RM2
R W
01 0010 0010 0010 11
RM3
R W
01 0010 0010 0010 11
Full Configuration
Error SignalsStatic Parts
ICAP
MuxMux Mux
Figure 18 Implemented design ‐ slave side
3.1.2.1 StaticRegion
The static region contains the parts that cannot or should not be reconfigured. These items could be
ICAP_VIRTEX5, I/O buffers or DCMs.
3.1.2.1.1 ICAP_VIRTEX5[31]
The ICAP_VIRTEX5 primitive works the same way as the SelectMAP configuration interface except it is
on the fabric side and ICAP has a separate read/write bus, as opposed to the bidirectional bus in
SelectMAP. The general SelectMAP timing diagrams and the SelectMAP bitstream ordering information
as described in the “SelectMAP Configuration Interface” section of this user guide are also applicable to
ICAP. It allows the user to access configuration registers, readback configuration data, or partially
reconfigure the FPGA after configuration is done. ICAP has three data width selections through the ICAP
WIDTH parameter: x8, x16, and x32. The two ICAP ports cannot be operated simultaneously. The design
must start from the top ICAP, and then switch back and forth between the two.
Pin Name Type Description
CLK Input ICAP interface clock
CE Input Active‐Low ICAP interface select. Equivalent to CS_B in the SelectMAP interface.
WRITE Input 0=WRITE, 1=READ. Equivalent to the RDWR_B signal in the SelectMAP interface.
I[31:0] Input ICAP write data bus. The bus width depends on ICAP_WIDTH parameter. The bit ordering is identical to the SelectMAP interface. See ICAP Data Ordering in [31]
O[31:0] Output Unregistered ICAP read data bus. The bus width depends on the ICAP_WIDTH parameter. The bit ordering is identical to the SelectMAP interface.
BUSY Output Active‐High busy status. Only used in read operations. BUSY remains Low during writes.
x32 x16 x8
3.1.2.1.2
In many c
in some c
in the con
configura
This conv
of confus
hexadecim
Some app
applicatio
meaning
PROM file
x16, and x
31 30 29 224 25 26 2
3.1.2.1.3
In order t
ICAP data
buffers. “T
SelectIO p
the P and
differentia
25 D [0:7] r
ICAPData
cases, ICAP co
cases another
nfiguration da
tion data is lo
ention (D0 =
sion when d
mal value 0xA
CCLK
1
2
plications can
ons, it can b
that the bits
e generation
x32 modes.
28 27 26 2527 28 29 30
IBUFDS:d
to be able to
a bus betwee
The usage an
primitives. Di
N channel pi
al input buffe
epresent the IC
aOrdering[3
onfiguration i
r FPGA. In the
ata file corres
oaded at one
MSB, D7 = LS
designing cu
ABCD into the
Cycle HEX E
1 0
2 0
n accommoda
be more con
in each byte
software can
5 24 23 22 0 31 16 17
differentiali
o use the ICA
en two evalu
nd rules corre
fferential Sel
ins in a differe
er primitive.
Figure
CAP DATA pins
31]
is driven by a
ese applicatio
sponds to the
e byte per CC
SB) differs fro
ustom config
e ICAP data bu
Table 6 Bit Ord
Equivalent D
0xAB
0xCD
ate the non‐c
venient for
e of the data
n generate bit
Table
21 20 19 18 19 20
nputbuffer
P maximum
uation boards
esponding to
ectIO primiti
ential pair. N
e 19 Differential
s.
25
a user applica
ons, it is impo
e data orderi
CLK, with the
om many oth
guration solu
us.
ering for ICAP 8
D0 D1 D2
1 0 1
1 1 0
conventional
the source c
a stream are
t‐swapped PR
e 7 Bit Ordering
Pin 18 17 16 21 22 23
rprimitive
speed, we ne
s. To use the
the different
ves have two
channel pins
Input Buffer Pr
ation residing
ortant to und
ng expected
MSB of each
her devices. T
utions. Table
8‐Bit Mode
D3 D4
0 1
0 1
data orderin
configuration
reversed. Fo
ROM. Table 7
g
15 14 13 18 9 10 1
8 9 10 1
eed to use d
ese pairs, we
ial primitives
o pins to and
s have a “B” s
rimitive (IBUFDS
g on a microp
derstand how
by the FPGA
h byte presen
This conventio
e 6 shows h
D5 D6 D7
0 1 1
1 0 1
ng without d
n‐data file to
or these appl
7 shows the b
12 11 10 911 12 13 14
11 12 13 14
ifferential pa
e need to uti
s are similar t
from the de
suffix.” [32] Fi
S)
processor, CP
w the data ord
. In ICAP x8 m
nted to the D
on can be a s
how to load
725
1
1
ifficulty. For
o be bit‐swa
ications, the
bit ordering f
9 8 7 6 4 15 0 1
4 15 0 1
0 1
airs to connec
lize different
o the single e
evice pads to
igure 19 show
LD, or
dering
mode,
0 pin.
ource
d the
other
pped,
Xilinx
for x8,
5 4 3 2 2 3 4 5
2 3 4 5
2 3 4 5
ct the
tial IO
ended
show
ws the
1 0 6 7
6 7
6 7
26
3.1.2.2 PartialReconfigurationRegions
An implemented system on a FPGA should be divided into Partial Reconfiguration regions (PRR). Partial
reconfigurable modules (PRM) are the part of the design that can be placed in the PR regions. User may
have any number of PRM in a PRR; however only one PRM can be operated in a PRR at a given time.
The minimal size of the PRM is theoretically one CLB26 ; however, due to the structure of the
configuration memory, configuration of the CLB is contained in several frames and each frame contains
the configuration bits of 20 CLBs. Since the frame is the smallest part of the FPGA that can be
configured, every reconfiguration changes at least 20 CLBs [12]. The Size of the PRMs is important in
optimality of the performance. The author in [8] proposed a reliability‐aware solution for selecting an
optimal area for PRMs.
It is necessary to insert specific interface at the boarders of the PRRs. These interfaces are called proxy
logics in ISE design tools. The user can place proxy logic manually or they can be placed by design tool
automatically. In recent ISE‐design tools, these proxy logics are also supported by the timing analysis.
Therefore, it is possible to analyze the critical path between static region and PR regions.
The design flow can be done in ISE and PlanAhead. First, the ISE synthesize the VHDL or Verilog codes
and generates the necessary Netlist files. Next, these files are imported to the PlanAhead. Last, after the
procedure of floor planning in the PlanAhead, partial bitstream files (*.bit) will be generated for each
PRMs. These bitstream files have a header that contains the address of a PRM. For reconfiguring a PRM,
the relevant partial bitstream file should be forwarded to the configuration engine by means of one of
the available interfaces27.
3.2 Summary
In this chapter, we describe our proposed configuration controller in details. The implemented
configuration controller and its characteristics were presented. In the next chapter, we will discuss deign
hardening techniques for our proposed controller.
26 Configuration Logic Block 27 JTAG, SelectMAP, Master‐Serial or ICAP
27
4 DesignHardening
Up to now, all components of the system implemented in our multi‐FPGA platform is hardened by
combination of hardware‐redundancy techniques and partial reconfiguration capability for fault
detection, masking and recovery. In our work, each component is triplicated and each part is placed in
one PR region. By comparing the output of each part with the other parts, the voter can detect and
mask an error in a PR region and inform the reconfiguration controller for recovery. However, three
more issues should still be protected against SEUs.
In this chapter, we will discuss the strategy for hardening the Configuration Controller, the interfaces
and the bitstream. As previously discussed, the most robust mitigation strategy is to use redundancy
techniques coupled with partial reconfiguration property. In this design, three different approaches
have been utilized to increase the overall reliability.
4.1 StateMachineEncoding
Because the implemented circuits for PR_Controller, full_config_controller, fault_classifier and
bit_stream_module are based on finite state machines (FSM), the first step to increase the robustness of
the design is to encode FSMs. Many works have been carried out to apply an optimal state encoding
[30], [33]. There are many tools and techniques available to apply an optimal state encoding. Common
to most of them is to minimal the number of bits required for state encoding. A poor choice of encoding
techniques will result in a state machine that is very costly in terms if resource utilization or it is very
slow or both. Moreover, encoding must be applied in the hardware description language to ensure
reliability of protected FSM.
In this project, an optimized one‐hot state encoding has been embedded into hardware description
language of the state machines. In the one‐hot state encoding, only one bit of the state vector is set to
one for any given state and all other state bits remain zero. Thus if there are n states then n state flops
are required. State decode is simplified, since the state bits themselves can be used directly to indicate
whether the machine is in a particular state. No additional logic is required [30]. We have used one‐hot
state encoding because it has the following advantages:
It maps easily into Xilinx register‐based FPGA architecture and it is easy to apply one‐hot state
encoding to a state machine. Schematics can be captured and HDL code can be written directly
from the state diagram without coding a state table. [30]
One‐hot state encoding is typically faster than other state ending techniques. Moreover, Speed
is independent of the number of states, and instead depends only on the number of transitions
into a particular state. [30]
It is very easy to modify the design without manipulating the rest of the machine.
It
st
Xilinx can
this prope
state enco
The error
If there is
4.2 In
In additio
hardened
signals a
reconfigu
an undesi
may corr
technique
not valid
the error
4.3 In
The next s
FPGAs in a
can be easil
tatic timing a
apply one‐h
erty since in
oding directly
detection is
more than o
ternalSig
on to state m
. An undesire
re the stat_
ration contro
ired upset on
upted. To pr
e, the signal is
values. An e
detection is a
terfaceH
step is to har
a multi‐FPGA
ly synthesize
nalysis. [30]
ot state enco
this case erro
y to the FSM V
quite easy in
ne bit asserte
gnalHard
machines, the
ed value on t
_PR and sta
oller and full c
n these signals
revent this s
s presented b
rror signals w
also very simp
ardening
den the conn
A platform.
Figure 2
d from VHDL
oding when sy
or detection
VHDL codes.
this scenario
ed in a given s
dening
re are some
these signals
rt_full_config
configuration
s or if they st
ituation, we
by two bits. “
will be genera
ple, and could
g
nection signal
0 the connectio
28
L or Verilog a
ynthesizing th
is not possib
o. It is only ne
state vector,
internal sign
may start a c
g that are u
controller to
tock to zero o
have used 2
10” presents
ated in case
d be impleme
s between tw
on between two
and it is poss
he circuit, ho
ble. Therefore
ecessary to ch
there will be
als between
component u
used by fau
o start their fu
or one, the fu
2‐rail logic to
‘1’ and “01”
of occurrenc
ented by an X
wo evaluation
o evaluation boa
sible to find
owever, it is n
e, we have ap
heck the state
an error.
components
unexpectedly.
ult_classifier
unctions resp
unctionality o
o encode the
presents ‘0’.
e of “00” an
XOR gate.
n boards (Figu
ards
critical path
not possible t
pplied the on
e bits concurr
s, which shou
. Examples of
to inform p
pectively. If th
f the whole d
em. In this s
“00” and “11
d “11”. More
ure 20) or two
using
to use
ne‐hot
rently.
uld be
f such
partial
here is
design
simple
1” are
eover,
o
29
These signals are susceptible to faults caused mainly by radiation or electromagnetic interference. Since,
the ICAP data pins are 16‐bit or 32‐bit, and they are working with 100 MHz clock frequency, cross‐talk is
also possible to happen. We have used differential pairs to prevent such phenomenon. The principle of
differential pairs is quite the same as 2‐rail logic. In differential pairs, “10” presents ‘1’ and “01” presents
‘0’ and “00” and “11” are not valid values.
4.4 BitstreamMemoryProtection
In this study, the necessary bitstreams for reconfiguring the PR regions and also the whole FPGA are
generated with the help of tool chain and stored in an external non‐volatile flash memory. These flash
memories are susceptible to SEUs [34]. Since the reliability of the whole system is depending on the
correctness of these original bitstreams, we need to protect them against radiation. In this work, we
envisioned a solution to protect the original bitstreams based on using a radiation‐hardened memory.
However, this is not the only solution for protection. Another possible solution is to utilize error control
codes [35], [36], [37].
Particularly, the author in [35] has presented encoders and decoders of error control codes for
semiconductor memory systems used in the space radiation environment. In this work, widely‐used
error control codes, such as Hamming and Reed‐Solomon (RS) codes, compared with new classes of byte
error control codes suitable for semiconductor memory systems, called spotty byte error control codes.
The author concluded that the spotty byte error control codes show better performance in terms of gate
counts and maximum clock frequencies. With the help of this technique, we can benefit from regular
non‐volatile memories without worrying about the incorrectness of the original version of bitstreams.
5 Te
The desig
mentione
error sign
considerin
error sign
configura
two PR m
correspon
represent
steps invo
Figure 21
PR modu
generated
which cor
process.
28 “00” and
stResu
gn has been t
ed, the contro
nals. The erro
ng the error
nals are foll
tion controlle
modules. On
nding PR regio
t that there is
olved when u
shows the lo
les are imple
d. These bitst
rresponds to
d “11” on error
ults
tested on two
oller is able to
or detection o
detection m
owing the 2
er, Three PR
e of them is
on works cor
s an error in
sing the PlanA
ocation of the
Figur
emented in a
tream files ha
correct beha
r signals indica
o identical XU
o reconfigure
of the slave F
ethod, which
2‐rail checkin
regions have
s generating
rectly; the ot
the correspo
Ahead softwa
se PR regions
re 21 generated
an 8x8 CLBs.
ave a size of
aviors, are st
te an error in t
30
UPV5‐LX110T
e the faulty pa
FPGA is the r
h has been u
ng rules28. T
been create
a string of
her module is
onding PR reg
are for Partia
s after floor p
d PR regions on t
For each PR
24348 Bytes
tored on the
the correspond
T evaluation b
art of the slav
responsibility
used in the s
To verify the
d in the slave
“10” and “0
s generating
gion. Chapter
l Reconfigura
planning in Pla
the FPGA fabric
modules, on
each. One b
on‐chip BRA
ding compone
boards (Figure
ve FPGA base
of the user;
lave side, we
e correct fun
e side. Each P
01” which re
a string of “1
r 4 in [15] de
ation designs.
anAhead.
ne test‐bitstr
itstream file
AMs For part
nt
e 6). As prev
ed on the rec
however, wi
e assume tha
nctionality o
PR region con
epresent tha
11” and “00” w
escribes the d
.
eam file has
in each PR re
ial reconfigur
iously
ceived
ithout
at the
of the
ntains
at the
which
design
been
egion,
ration
31
The other three bitstreams, which are not stored in the BRAM, are used to simulate faults in the PR
regions. These partial bitstreams are downloaded to the FPGA via JTAG in iMPACT tool. After
downloading, the corresponding PR regions will start sending error signals to the implemented
configuration controller in the master side. Then the master side will respond to the error signals by
reconfiguring the corresponding PR region by a correct PR module.
The above‐mentioned process has been tested for 100 times and for each PR region with different
number of CLBs. The configuration controller was able to correct all of the simulated faults by
performing partial reconfiguration or full configuration. Table 8 shows the device utilization summary
for the configuration controller at the master side. The resource utilization of bitstream module is not
included in this summary.
Table 8 device utilization summary for configuration controler (exclude bitstream module)
Slice Logic Utilization Used Available Utilization
Number of Slice Registers 164 69,120 1%
Number used as Flip Flops 164
Number of Slice LUTs 239 69,120 1%
Number used as logic 236 69,120 1%
Number using O6 output only 114
Number using O5 output only 56
Number using O5 and O6 66
Number used as exclusive route‐thru 3
Number of route‐thrus 59
Number using O6 output only 59
Number of occupied Slices 118 17,280 1%
Number of LUT Flip Flop pairs used 259
Number with an unused Flip Flop 95 259 36%
Number with an unused LUT 20 259 7%
Number of fully used LUT‐FF pairs 144 259 55%
Number of unique control sets 18
Number of slice register sites lost to control set restrictions 20 69,120 1%
Number of bonded IOBs 61 640 9%
Number of LOCed IOBs 61 61 100%
IOB Master Pads 16
IOB Slave Pads 16
Number of Block RAM/FIFO 18 148 12%
32
Number using Block RAM only 18
Number of 36k Block RAM used 16
Number of 18k Block RAM used 3
Total Memory used (KB) 630 5,328 11%
Number of BUFG/BUFGCTRLs 3 32 9%
Number used as BUFGs 3
Number of DCM_ADVs 1 12 8%
Average Fan‐out of Non‐Clock Nets 3.72
Our implemented generic controller shows a better performance in terms of speed compare to other
generic controllers and software based controllers. Table 9 compares our design with another generic
reconfiguration controller proposed by Ali Ebrahim in [26] and a software‐based controller based on the
Xilinx XPS_HWICAP engine presented in [7].
Table 9 configuration times for different partial bitstreams
Partial Bitstream Size
(KB)
Configuration Time (us)
XPS_HWICAP(x32)
[7]
BRAM HWICAP (x32) [7]
ICAP controller(x32)
[26]
Our ICAP Controller
(x16)
Our ICAP Controller
(x32)
7.7 533 28.0 21.7 39.5 19.7
23.2 1600 66.3 62.6 118.8 59.4
47.2 3300 121.7 124.1 241.7 120.9
In addition to this, our proposed configuration controller shows better results in terms of resource
utilization compare to the proposed design in [26]. Ali Ebrahim [26] implemented his proposed
controller on 609 FPGA slices; however, our design is utilized only 239 FPGA slices (Table 10), which
shows a significant reduction in space utilization29.
Table 10 Resource utilization of ICAP controller
Resources XPS_HWICAP(x32)
[7] BRAM HWICAP
(x32) [7] ICAP controller(x32)
[26] Our ICAP Controller
(x32)
LUTs (total)
3275 963 609 239
29 The external memory interfaces are excluded in both designs to calculate resource utilization.
33
6 ConclusionandFutureWorks
The research presented in this thesis has proposed a dependable reconfiguration controller with the aim
of recovering the faulty portion of the FPGA in a multi‐FPGA platform. Our working scenario was harsh
environment such as mission‐critical and safety‐critical applications where electronic devices are
susceptible to SEUs caused by ionizing radiation. In this thesis, different types of fault tolerant
techniques for systems based on FPGAs are discussed. It was concluded that the best fault tolerant
technique for a FPGA‐based design, is to use redundancy techniques for fault detection and fault
containment, then use a recovery technique based on Xilinx partial reconfiguration for mitigating the
fault. The main innovative contributions provided by this thesis are summarized as follows:
The controller is implemented purely on hardware. Not only this generic implementation
increases performance in terms of higher speed and lower resource utilization, but also it allows
the designer to apply any available FT techniques for increasing the reliability of the controller.
The configuration interfaces for full configuration and partial reconfiguration are completely
separated from each other. The full configuration is done via a serial interface whereas partial
reconfiguration is done via the Parallel ICAP interface. This method will increase the overall
reliability because, if the partial reconfiguration stops functioning for any reason, there still is a
configuration solution for recovering the FPGA, eventually with lower speed.
Directions for future work aimed at its improvement are summarized in the following:
1‐ A comprehensive testing solution: The testing method used in this thesis is based on pre‐build
bitstream files, which simulate an SEU in the corresponding module. This method is not through
enough. There exist two other recognized testing methods, which should be considered as a
future work to this thesis. These two testing method are radiation‐testing strategies and fault‐
injection campaign. Although radiation test remains one of the worldwide‐recognized and
complete methods for SEU analysis, radiations may permanently damage the device under test
(DUT) and increase the testing cost for both the development of radiation setup and for the
time that beam operating. Moreover, there is no control on the beam to hit a specific location.
Therefore, an SEU may occur in an undesired bit. Another solution is to inject fault during
programming phase to emulate SEUs in the FPGA; however, this method requires huge amount
of time to provide consistent result. The better solution could be calculating the critical bits and
then performing fault injection based on these bits. One example of such approach is presented
in [38].
2‐ Problem with synchronization of PRMs: The Synchronization of a newly reconfigured module
with other modules in the FPGA or other FPGAs in a multi‐FPGA platform is another issue that
has not been addressed in this thesis. This step will be the next step after fault recovery. The
newly reconfigured module must start operating from a correct state. One solution to this
problem is presented in [12].
34
7 Glossary
Some common terminologies used through this document are defined in this section. These definitions
are taken from [39]
Device: A single integrated circuit.
Failure: An unrecoverable error.
Functional Error: A logic error in the user function.
Functional Interrupt: A disruption in device operation requiring system level intervention to regain
normal functionality. Typically causes the loss of user or system data.
Multiple‐Bit Upset (MBU): An SEU that results in more than one adjacent bits flipping due to an oblique
angle strike. MBU probability steadily increases as geometries shrink. Use of maximum MBU distance
observed is useful to determine block RAM interleaving required so that even MBUs can be corrected by
the ECC.
Single‐Bit Upset (SBU): Same as SEU.
Scrubbing: The process of correcting any configuration cell upsets through FPGA partial reconfiguration.
Scrubbing does not interrupt user design function.
Single‐Event Effect (SEE): The resulting electrical disturbances caused by the direct ionization of a silicon
lattice by an energetic charged subatomic particle.
Single‐Event Functional Interrupt (SEFI): An SEE that results in the interference of the normal operation
of a complex digital circuit. SEFI is typically used to indicate a failure in a support circuit, such as loss of
configuration capability, power on reset, JTAG functionality, a region of configuration memory, or the
entire configuration.
Single‐Event Transient (SET): A signal transition caused by a SEE. Often observed as a glitch.
Single‐Event Upset (SEU): A state change (or flip) of a single data bit storage or memory cell caused by
an SEE. An SEU can affect the configuration memory cell states, the block RAM contents, a CLB DFF, a
LUTRAM, or SRL16 memory cell (which are also configuration memory cells, directly accessible to the
user).
System: An integration of multiple devices and circuit boards or modular sub‐systems.
User Function: User‐specified operational functions defined by the data stored in device configuration
memory.
35
8 WorksCited
[1] M. Caffrey, "A Space Based Reconfigurable Radio," Military and Aerospace Applications of Programmable Logic
Devices (MAPLD), Laurel MD, USA, 2002.
[2] A. Dawood, S. Visser and J. Williams, "Reconfigurable FPGAS for real time image processing in space," in 14th
International Conference on Digital Signal Processing, DSP2002, Santorini, Greece, 2002.
[3] D. M. Hiemstra, G. Battiston and P. Gill, "Single Event Upset Characterization of the Virtex‐5 Field Programmable
Gate Array Using Proton Irradiation," in IEEE Radiation Effects Data Workshop (REDW), Denver, CO, 2010.
[4] M. Ceschia, M. Menichelli, A. Papi, J. Wyss and A. Paccagnella, "Ion beam testing of SRAM‐based FPGA's," in
Radiation and Its Effects on Components and Systems, 2001. 6th European Conference on, 2001.
[5] E. Fuller, P. Blain, M. Caffrey and C. Carmichael, "Radiation Test Results of the Virtex FPGA and ZBT SRAM for
Space Based Reconfigurable Computing," Xilinx Inc., Los Alamos National Laboratory, 1999.
[6] L. Sterpone, M. Aguirre, J. Tombs and H. Guzmán‐Miran, "On the design of tunable fault tolerant circuits on
SRAM‐based FPGAs for safety critical applications," in Design automation and test in Europe, Torino, Sevilla,
2008.
[7] M. liu, W. Kuehn, Z. Lu and A. Jantsch, "Run‐Time Partial Reconfiguration Speed Investigation and Architectural
Design Space Exploration," in FPL, Giessen, Germany, 2009.
[8] C. Bolchini, A. Miele and C. Sandioni, "A Novel Design Methodology for Implementing Reliability‐Aware Systems
on SRAM‐Based FPGAs," IEEE TRANSACTIONS ON COMPUTERS, vol. 60, no. 12, pp. 1744 ‐ 1758, 2011.
[9] J. Hussein and G. Swift, "Mitigating Single‐Event Upsets," Xilinx, 2012.
[10] "FPGA vs. ASIC," Xilinx Inc., 2012. [Online]. Available: http://www.xilinx.com/fpga/asic.htm.
[11] I. Kuon and J. Rose, "Measuring the Gap between FPGAs and ASICs," in FPGA’06, Toronto, 2006.
[12] M. Straka, J. Kastil and Z. Kotasek, "Fault Tolerant Structure for SRAM‐based FPGA via Partia Dynamic
Reconfiguration," Digital System Design: Architecture, Methods and Tools, pp. 365‐372, 2010.
[13] F. Lima, C. Carmichael, J. J. Fabula and R. Padovani, "A fault injection analysis of Virtex FPGA TMR design
methodology," in Radiation and Its Effects on Components and Systems, 2001. 6th European Conference on,
2001.
36
[14] K. S. Morgan, D. L. McMurtrey, B. H. Pratt and M. J. Wirthlin, "A Comparison of TMR With Alternative Fault‐
Tolerant Design Techniques for FPGAs," Nuclear Science, IEEE Transactions on, vol. 54, no. 6, pp. 2065 ‐ 2072,
2007.
[15] "Partial Reconfiguration User Guide," Xilinx Inc., 2011.
[16] L. Ming, W. Kuehn, L. Zhonghai and A. Jantsch, "Run‐Time Partial Reconfiguration Speed Investigation and
Architectural Design Space Exploration," in International Conference on Field Programmable Logic and
Applications, FPL 2009, Prague, 2009.
[17] M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. LaBel, M. Friendlich, H. Kim and A. Phan, "Effectiveness of
Internal Versus External SEU Scrubbing Mitigation Strategies in a Xilinx FPGA: Design, Test, and Analysis," IEEE
Transactions on Nuclear Science, vol. 55, no. 4, pp. 2259 ‐ 2266, 2008.
[18] K. Chapmanl, "SEU Strategies for Virtex‐5 Devices (XAPP864)," Xilinx Inc., 2010.
[19] A. Sari and M. Psarakis, "Scrubbing‐based SEU Mitigation Approach for Systems‐on‐Programmable‐Chips," in
International Conference on Field‐Programmable Technology (FPT), New Delhi, 2011.
[20] M. Niknahad, O. Sander and J. Becker, "Fine grain fault tolerance ‐ A key to high reliability for FPGAs in space," in
IEEE Aerospace Conference, Big Sky, MT, 2012.
[21] K. Kyriakoulakos and D. Pnevmatikatos, "A novel SRAM‐based FPGA architecture for efficient TMR fault
tolerance support," in International Conference on Field Programmable Logic and Applications, FPL, Prague,
2009.
[22] C. Carmichae, "Triple Module Redundancy Design Techniques for Virtex FPGAs (XAPP197)," Xilinx Inc., 2006.
[23] "TMRTool," Xilinx Inc., [Online]. Available: http://www.xilinx.com/ise/optional_prod/tmrtool.htm. [Accessed
2013].
[24] F. de Lima Kastensmidt, "Designing fault‐tolerant techniques for SRAM‐based FPGAs," vol. 21, no. 6, pp. 552‐
562, 2004.
[25] C. Bolchini, L. Fossati, D. Codinachs, A. Miele and C. Sandionigi, "{A reliable reconfiguration controller for fault‐
tolerant embedded systems on multi‐FPGA platform," in IEEE 25th International Symposium on Defect and Fault
Tolerance in VLSI Systems (DFT), Kyoto, 2010.
[26] A. Ebrahim, K. Benkrid, X. Iturbe and C. Hong, "A Novel High‐Performance Fault‐Tolerant ICAP Controller,"
Edinburgh.
[27] Y. shu‐Yi and E. J. McCluskey, "On‐line Testing and Recovery in TMR Systems for Real‐Time Applications," in ITC
INTERNATIONAL TEST CONFERENCE, Stanford University, Stanford, California, 2001.
37
[28] D. Nikolos, "Self‐Testing Embedded Two‐Rail Checkers," Journal of Electronic Testing: Theory and Applications ‐
Special issue on On‐line testing, vol. 12, no. 1 ‐ 2, pp. 69 ‐ 79, 1998.
[29] M. Omana, D. Rossi and C. Metra, "High Speed and Highly Testable Parallel Two‐Rail Code Checker," in Design,
Automation and Test in Europe Conference and Exhibition, 2003.
[30] S. Golson, "One‐hot state machine design for FPGAs," in 3rd PLD Design Conference, Santa Clara CA, 1993.
[31] "Virtex‐5 FPGA Configuration User Guide," Xilinx Inc., 2011.
[32] "Virtex‐5 FPGA User Guide (UG190)," Xilinx Inc., 2012.
[33] M. Cassel and F. Lima, "Evaluating one‐hot encoding finite state machines for SEU reliability in SRAM‐based
FPGAs," in 12th IEEE International On‐Line Testing Symposium, IOLTS, Lake Como, 2006.
[34] . D. Nguyen, . S. Guertin and . J. Patterson, "Radiation Tests on 2Gb NAND Flash Memories," in IEEE Radiation
Effects Data Workshop, Ponte Vedra, FL, 2006.
[35] H. Kaneko, "Error Control Coding for Semiconductor Memory Systems in the Space Radiation Environment," in
20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, DFT, 2005.
[36] G. Umanesan and E. Fujiwara, "A class of random multiple bits in a byte error correcting (Stb/EC) codes for
semiconductor memory systems," in Pacific Rim International Symposium on Dependable Computing, 2002.
Proceedings, 2002.
[37] G. Umanesan and E. Fujiwara, "A class of systematic t/B‐error correcting codes for semiconductor memory
systems," in Information Theory Workshop, IEEE Proceedings., Cairns, Qld., 2001.
[38] L. Sterpone, F. Margaglia, M. Koester, J. Hagemeyer and M. Porrmann, "Analysis of SEU Effects in Partially
Reconfigurable SoPCs," in Adaptive Hardware and Systems (AHS), 2011 NASA/ESA , San Diego, CA, 2011.
[39] B. Bridgford, C. Carmichael and W. Tseng, "Single‐Event Upset Mitigation Selection Guide," Xilinx, 2008.
[40] "Early Access Partial Reconfiguration User Guide," 2006.
[41] "LogiCORE™ IP Soft Error Mitigation Controller v2.1," Xilinx Inc., 2011.
[42] E. Dubrova, FAULT TOLERANT DESIGN:AN INTRODUCTION, Stockholm: Kluwer Academic Publishers, 2008, p.
147.
[43] "Virtex‐5 Family Overview," Xilinx, 2009.
[44] "hkinventory," Xilinx, 2012. [Online]. Available:
38
http://www.hkinventory.com/public/ECatalogResultProductDetails.asp?CompanyID=104266&ProductID=27031.
[45] S. Suhail Zain and C. Hu, "NSEU Mitigation in Avionics Applications," 2010.
[46] "Xilinx University Program XUPV5‐LX110T Development System," Xilinx Inc., 2012. [Online]. Available:
http://www.xilinx.com/univ/xupv5‐lx110t.htm.
[47] "Soft Error Mitigation Controller," Xilinx Inc., 2011.
[48] "Partial Reconfiguration User Guide ‐ UG702," 2011.
39
9 Appendices
9.1 AppendixA:BitstreamScrubbingandReadback
Upsets in Xilinx FPGA can be removed by advanced scrubbing. Scrubbing means reading back the
configuration bitstream that is stored in configuration memory, comparing it with an original one and
correcting any affected configuration bits. The configuration management system, which is able to
detect and correct any upsets in configuration memory by means of scrubbing, can be hosted in a
radiation‐hard FPGA, ASIC, Microcontroller or the FPGA itself. The internal scrubbing in Virtex‐5 FPGA is
done via ICAP for reading back the frames in conjunction with Frame Error Correction Code (ECC) for
detect single or double‐bit errors in configuration frame data.
Configuration management can only detect and correct errors caused by SEUs. It cannot mitigate the
SEU’s effects. Therefore, configuration management is often combined with redundant FPGA mitigation
schemes to mask the SEU’s effects in the system.
Virtex‐5 FPGA configuration memory is arranged in frames that are tiled about the device. These frames
are the smallest addressable segments of the Virtex‐5 configuration memory space, and all operations
must therefore, act upon whole configuration frames. [31]
Frame (Figure 22) is the smallest part of the FPGA that can be reconfigured and has a size of 1362 bits in
the Virtex‐5. [40] Virtex‐5 (LX110T) frame counts and configuration sizes are shown in Table 1.
Table 11 Virtex‐5 Device Frame Count, Frame Length, Overhead, and Bitstream Size [31]
Device Non‐
Configuration Frames
Configuration Frames
Total Device Frames
Frame Lengths in Words
Configuration Array Size in
Words
Bitstream Overhead in Words1
LX110T 592 23,712 24,304 41 972,192 272
1‐ Configuration overhead consists of commands in the bitstream that are needed to perform configuration
but do not themselves program any memory cells. Configuration overhead contributes to the overall
bitstream size.
There are
Advantag
•
•
•
•
•
Disadvant
•
•
•
•
•
•
•
Xilinx has
The SEM
correction
desired fu
Generato
e some advant
ges:
There is no
It can reco
The state o
Need less
All implem
entire desi
tages:
This meth
frame; if a
reads back
The recove
It cannot m
work of th
It is not po
time and c
should be
This metho
functional
Difficult to
It is not po
recently intr
Controller im
n, and error
unctions are s
r. [41]
Figure
tages and dis
o need to par
onfigure a com
of a compone
memory than
mentation opt
ign)
od is blind, w
an upset occu
k its relative f
ery process m
mitigate the S
e design afte
ossible to det
cannot be te
done by mea
od can only b
ity of the com
o monitor if an
ossible to dete
roduced a SEU
mplements fiv
classificatio
selected durin
e 22 A schematic
sadvantages t
rtitioning the
mponent in a
ent can be res
n PR approach
tions are ava
which means
urs it will fix
frame.
may take time
SEU’s effect. T
r recovery.
ect SEUs in th
ested by scru
ans of other te
be used to rec
mponents. Tim
n error has oc
ect more tha
U mitigation c
ve main funct
n. All functi
ng the IP core
40
c FPGA structure
to this metho
device. There
finer granula
served
h. Only one b
ailable. (e.g. t
s it will read
it. This meth
e (it depends o
Therefore, sc
he BRAMs an
bbing. Detec
echniques. [1
configure wit
me multiplex
ccurred durin
n two SEUs a
controller (SE
tions: initializ
ons, except
e configuratio
e. Taken from [8
d. These inclu
efore, the pe
arity than PR a
bitstream file
techniques pe
back the con
hod is not ab
on the fault lo
crubbing is no
nd the registe
ction of error
12]
th the same b
is not possib
ng reconfigura
nd correct m
EM), which is
ation, error i
initialization
on and genera
8]
ude:
rformance re
approach.
is needed
erform optim
nfiguration b
ble to locate
ocation).
ot able to gua
ers since their
rs in these pa
bitstream. It c
le
ation of a fra
ore than one
s based on bit
njection, erro
and detect
ation process
emains unaffe
mization acros
itstream fram
the error un
arantee the co
r values chan
arts of the sy
cannot chang
me
e SEU in a fram
tstream scrub
or detection,
ion, are opt
s in the Xilinx
ected.
ss the
me by
less it
orrect
ges in
ystem
ge the
me
bbing.
error
tional;
CORE
41
The controller initializes by bringing the integrated soft error detection capability of the FPGA into a
known state after the FPGA enters user mode. After this initialization, the controller endlessly loops,
observing the integrated soft error detection status. When an ECC or CRC error is detected, the
controller evaluates the situation to identify the Configuration Memory location involved. [41]
Once this is complete, the controller may optionally correct the soft error by repairing it or by replacing
the affected bits. The repair methods are active partial reconfiguration to perform a localized correction
of Configuration Memory using a read‐modify‐write scheme. These methods use algorithms to identify
the error in need of correction. The replace method is also active partial reconfiguration with the same
goal, but this method uses a write‐only scheme to replace Configuration Memory with original data. This
data is provided by the implementation tools and stored outside the controller. [41]
The controller may optionally classify the soft error as essential or non‐essential using a lookup table.
The lookup table is stored outside the controller and is fetched as required during execution of error
classification. This data is also provided by the implementation tools and stored outside the controller.
[41]
When the controller is idle, there is an option to accept input from the user to inject errors into
Configuration Memory. This function is useful for testing the integration of the controller into a larger
system design. Using the error injection capability, system verification and validation engineers may
construct test cases to ensure the complete system responds to soft error events as expected. [41]
The SEM controller uses ICAP for readback and accessing the configuration memory. The ICAP Interface
is a point‐to‐point connection between the SEM Controller and the ICAP primitive. The ICAP primitive
enables read and write access to the registers inside the FPGA configuration system. For error detection,
the SEM controller uses FRAME_ECC Interface. The FRAME_ECC primitive is an output‐only primitive
that provides a window into the soft error detection function in the FPGA configuration system. The
Virtex‐5 Frame error correction code (ECC) logic is designed to detect single‐ or double‐bit errors in
configuration frame data. [41] [31]
There are some advantages and disadvantages to this method. These include:
Advantages:
Support different error detection and correction techniques [41]
Completely flexible. Can be used on many applications [41]
Can perform error detection, error containment, error classification, and error correction
Various status and monitor registers
Disadvantages:
Only available for special Xilinx series (Spartan‐6, Virtex‐6, Virtex‐7, Kintex‐7 series)
Error detection is not optimal (use FRAME_ECC primitive which read configuration memory
frame by frame periodically)
42
9.2 AppendixB:Redundancy
One method to provide fault tolerance in embedded systems is through redundancy. For our purposes,
“redundancy is the provision of functional capabilities that would be unnecessary in a fault free
environment. This can be a replicated hardware component, an additional check bit attached to a string
of digital data, or a few lines of program code verifying the correctness of the program’s results.” [42]
“Two kinds of redundancy are possible: space redundancy and time redundancy. Space redundancy
provides additional components, functions, or data items that are unnecessary for a fault‐free
operation. Space redundancy is further classified into hardware, software and information redundancy,
depending on the type of redundant resources added to the system. In time redundancy the
computation or data transmission is repeated and the result is compared to a stored copy of the
previous result.” [42]
The term redundancy in literatures mostly returns to space redundancy. The most common form of
space redundancy is Triple Modular Redundancy (TMR). Figure 23 Shows the TMR basic principle. In
TMR, the components are triplicated and their outputs are compared to each other. If there is an error
in one module, the voter will mask the error. TMR can be applied to different granularity, from logic
level to system level.
Figure 23 TMR basic principle
In addition to TMR, there are many other hardware redundancy techniques available (such as N modular
redundancy, duplication with comparison, standby sparing, self‐purging redundancy and Triplex‐duplex
redundancy [42]). Xilinx has introduced XTMR30 software tool to simplify the task of design triplication.
According to [39] “TMRtool can partially or fully triplicate a design, insert voters, synchronize feedback
path loops, and allow customized user‐triplicated module insertion. A triplicated design mitigates SEU
impact on the user design.” However, the XTMR is very costly in terms of resource utilization and as a
result, leads to lower frequency and higher power consumption.
30 Xilinx Triple Modular Redundancy
43
The redundancy can also be applied to the device level. For instance, in Figure 24 the FPGA is triplicated
with two identical duplications. However, in this design the voter is itself a point of failure and must be
implemented on a Radiation‐Hard device. This design can also be very expensive.
Figure 24 TMR ‐ Device Level
To summarize, Redundancy is common techniques for almost all approaches to a FT design; however, it
cannot be considered as a mitigation scheme in a FPGA design solely, because redundancy can only
detect and mask the fault and it cannot recover the modules from faults. Therefore, the best mitigation
schemes would be a combination of redundancy and a reconfiguration controller for detecting, masking
and correcting the faults. Table 12 Summarize the performance overview of mitigation schemes.
Table 12 Performance Overview of mitigation schemes. Part of the table is taken from [12]
Mitigation Scheme
Mitigation Strength
Board Layout
Complexity
Ease in Meeting Timing
Constraints
Power Consumption
Component Cost
Average Recovery
Speed
power cycling Weak Low Normal Typical Low Lowest XTMR Medium High Reduced ~3X typical Low N/A
Bitstream Scrubbing
Medium Low Normal Typical Medium Low
PDR Medium Low Reduced Typical Low High XTMR + Bitstream Scrubbing
Strong High Reduced ~3X Typical Medium Low
XTMR + PDR Strong High Reduced ~3X Typical Low High Redundant devices + Bitstream Scrubbing
Strongest Medium Normal 2~4X typical High Low
Redundant devices + PDR
Strong Medium Reduced 2~4X typical High Highest
As previously mentioned, a combination of redundancy and a reconfiguration controller will be the
strongest mitigation scheme. For instance, redundant devices (which could be a combination of
redundancy at component and device level in multi‐FPGA platforms) plus bitstream scrubbing or PDR
will lead to the best mitigation result. One of the important differences between scrubbing and PDR is
that scrubbing may show a better result in recovering the upsets, especially when SEU occurs in the
routing bits, however, it’s recovery speed could be much lower than PDR.
9.3 Ap
Xilinx Virt
performa
1‐ V
2‐ V
3‐ V
4‐ V
Device
XC5VLX110T
1‐ Vi
in
2‐ Ea
3‐ Bl
4‐ Ea
5‐ Ro
to
6‐ Th
The propo
Table 13
configura
applicatio
contains
Moreover
ppendixC
tex‐5 FPGAs a
nce of 550 M
irtex‐5 LX: Hig
irtex‐5 LXT: H
irtex‐5 SXT: S
irtex‐5 FXT: E
Configurat(
Array (Row x Col)
Slic
T 160 x 54
17,2
irtex‐5 FPGA slic
put LUTs and fo
ach DSP48E slice
ock RAMs are fu
ach Clock Manag
ocketIO GTP tra
o run from 150 M
his number does
osed controll
shows the d
tion bitstrea
on specific co
bits that set
r, the bitstre
C:XilinxV
are one of th
MHz. This Fam
gh performan
High performa
Signal process
Embedded sys
Table 13 Vi
tion Logic Blocks (CLBs)
es1
Max DistributedRAM(kb)
280 1,120
ces are organize
our flip‐flops (pre
e contains a 25 x
undamentally 36
gement Tile (CM
nsceivers are de
Mb/s to 6.5 Gb/s
s not include Roc
Figure 25 X
ler in this the
evice specific
m in SRAM‐
onfiguration d
the configur
eam contains
Virtex‐5ov
e Virtex fami
ily is divided
nce general lo
ance logic wit
sing applicatio
stems with ad
irtex‐5 (LX110T)
DSP48E slices
2
d 1k
64 2
ed differently fr
eviously it was tw
x 18 multiplier, a
6 Kbits in size. Ea
MT) contains two
esigned to run f
s.
cketIO transceiv
Xilinx Virtex‐5 X
esis is implem
cation. Like a
‐type interna
data into inte
ration for ea
s all necessa
44
verview
ilies which in
into four diffe
ogic applicatio
th advanced s
ons with adva
dvanced seria
) device specific
Block RAM blocks
18 kb
3 36 kb
M(k
96 148 5,3
rom previous ge
wo LUTs and tw
n adder, and an
ach block can als
DCMs and one
rom 100 Mb/s t
vers.
XC5VLX110T dev
mented on Xi
all other Xilin
al latches. V
ernal memor
ch LUT and f
ary data for
ntroduced by
erent categor
ons
serial connect
anced serial c
al connectivit
ation taken from
s
CMT4
Poprb
Max kb)
328 6
enerations. Each
o flip‐flops.)
n accumulator.
so be used as tw
PLL.
to 3.75 Gb/s. Ro
vice. Taken from
linx Virtex‐5
nx FPGA serie
irtex‐5 devic
ry via the con
flip‐flop as w
configuring
Xilinx in 200
ries:
tivity
connectivity
ty
m [43]
ower PC rocessor blocks
EthernMAC
N/A 4
h Virtex‐5 FPGA
wo independent
ocketIO GTX tran
m [44]
XC5VLX110T
es, Virtex‐5 fa
ces are conf
nfiguration in
well as all rou
the embedd
9 with the hi
net Cs
Max RockeIO
Transceive5
GTP GT
16 N/A
A slice contains
18‐Kbit blocks.
nsceivers are de
T FPGA (Figure
amilies store
igured by lo
nterface. This
uting connec
ded element
ighest
et
rs Max user I/O
6
TX
A 680
four 6‐
esigned
e 25).
e their
oading
s data
ctions.
ts like
45
PowerPC, ICAP, and the initial data for BRAMs [45]. Because Xilinx configuration memory is volatile, it
must be reconfigured each time it is turned on. The Virtex‐5 FPGA can be configured via several
configuration interfaces. These interfaces are listed in Table 14.
Table 14 Virtex‐5 Configuration Modes
No. Configuration Mode Type of interface
Bus Width (bit)
1 Master‐serial configuration mode Serial 1
2 Slave‐serial configuration mode Serial 1
3 Master SelectMAP configuration mode Parallel 8 or 16
4 Slave SelectMAP configuration mode Parallel 8 or 16 or 32
5 JTAG/Boundary‐Scan configuration mode Serial 1
6 Master Serial Peripheral Interface (SPI) Flash configuration mode
Serial 1
7 Master Byte Peripheral Interface Up (BPI‐Up) Flash configuration mode
Parallel 8 or 16
8 Master Byte Peripheral Interface Down (BPI‐Down) Flash configuration mode
Parallel 8 or 16
Among these interfaces, we have used Master‐serial configuration for doing full FPGA configuration, and
Internal Configuration Access Port (ICAP), which is based on SelectMAP protocol, for doing partial
reconfiguration.
The XUPV505‐LX110T is a feature‐rich general‐purpose evaluation and development platform with on‐
board memory and industry standard connectivity interfaces. It features the Virtex‐5 XC5VLX110T
device. [46]. The evaluation platform (Figure 26) has the following features:
Xilinx Virtex‐5 XC5VLX110T FPGA
Two Xilinx XCF32P Platform Flash PROMs (32 MB each) for storing large device configurations
Xilinx System ACE Compact Flash configuration controller
64‐bit wide 256Mbyte DDR2 small outline DIMM (SODIMM) module compatible with EDK
supported IP and software drivers
On‐board 32‐bit ZBT synchronous SRAM and Intel P30 Strata Flash
10/100/1000 tri‐speed Ethernet PHY supporting MII, GMII, RGMII, and SGMII interfaces
USB host and peripheral controllers
Programmable system clock generator
Stereo AC97 codec with line in, line out, headphone, microphone, and SPDIF digital audio jacks
RS‐232 port, 16x2 character LCD, and many other I/O devices and ports
In this the
master th
have utiliz
partial bit
can use
alternativ
dipswitch
connectin
esis, two iden
hat monitors
zed Two Xilin
tstream files
on‐board me
ve for storing
es, LEDs and
ng two identic
Figure 26 Xilin
ntical XUPV5‐
and, in case
nx XCF32P Pla
were stored
emories (suc
g partial or
d keys for tes
cal evaluation
nx XUPV5‐LX110
LX110T evalu
of failure, re
atform Flash
in on‐chip BR
ch as Compa
full bitstrea
sting the sys
n boards to ea
46
0T Evaluation Pl
uation platfor
ecovers the s
PROMs for st
RAMs and an
act Flash, ZB
m files. In a
tem. Moreov
ach other to s
atform. Taken f
rms have bee
second one th
toring the ful
off‐board At
BT synchrono
addition to
ver, the expa
shape a mast
from [46]
en used. One
hat plays the
ll configuratio
tmel I2C mem
ous SRAM or
this, we hav
ansion IOs ha
ter‐slave syste
plays the rol
e role of slave
on bitstreams
mory; howeve
r SPI flash)
ve used on‐
ave been use
em.
e of a
e. We
s. The
er, we
as an
board
ed for
9.4 Ap
9.4.1 C
Virtex®‐5
internal m
it is powe
configura
M
Sl
M
Sl
JT
M
M
M
9.4.2 S
In serial c
In
In
Figure 27 an FPGA i
M
Sl
Se
G
ppendixD
Configuratio
devices are c
memory. Beca
ered‐up. The
tion pins serv
Master‐serial c
lave‐serial co
Master SelectM
lave SelectMA
TAG/Boundar
Master Serial P
Master Byte Pe
Master Byte Pe
erialConfig
onfiguration
n Master Seria
n Slave Serial
shows the bn serial mode
Master serial c
lave serial con
erial daisy‐ch
anged serial
D:Configu
onModesa
configured by
ause Xilinx FP
e bitstream is
ve as the inte
configuration
nfiguration m
MAP (parallel
AP (parallel) c
ry‐Scan config
Peripheral Int
eripheral Inte
eripheral Inte
gurationIn
modes, the F
al mode, CCL
mode, CCLK i
asic Virtex‐5 e:
configuration
nfiguration
ain configura
configuration
Figure 27 Virtex
urationm
andPinsin
y loading app
PGA configura
s loaded into
rface for a nu
mode
mode
l) configuratio
configuration
guration mod
terface (SPI) F
erface Up (BP
erface Down (
nterface[3
FPGA is config
K is an outpu
is an input.
serial configu
ation
n
x‐5 FPGA Serial
47
modesinV
nVirtex5[3
lication‐speci
ation memory
o the device
umber of diffe
on mode (x8 a
mode (x8, x1
de
Flash configur
PI‐Up) Flash co
(BPI‐Down) F
1]
gured by load
t.
uration interf
Configuration In
Virtex5
31]
ific configura
y is volatile, it
through spe
erent configu
and x16 only)
16, and x32)
ration mode
onfiguration
lash configur
ding one confi
face. There a
nterface. Taken
tion data—th
t must be con
ecial configu
uration modes
)
mode (x8 and
ration mode (
iguration bit p
re four meth
n from [31]
he bitstream—
nfigured each
ration pins. T
s:
d x16 only)
(x8 and x16 o
per CCLK cycl
hods of config
—into
h time
These
nly)
le:
guring
Pin nam
M[2:0]
CCLK
D_IN
DOUT_BU
DONE
INIT_B
PROGRAM
Figure 28
Notes rele
1. B
0
2. Fo
as
me Ty
] Inp
Out
Inp
USY Out
BidirecOpen‐or Ac
B InpuOut
Open‐
M_B Inp
shows how c
evant to Figu
it 0 represent
= 1, bit 1 = 0,
or Master con
s indicated by
Table 15
pe DedDua
put De
tput De
put De
tput De
ctional, Drain, ctive
De
ut or put, ‐Drain
De
put De
configuration
Figure 28 Ser
re 28:
ts the MSB o
, bit 2 = 1, etc
nfiguration m
y the arrow.
5 Virtex‐5 FPGA
dicated or al Purpose
edicated Ms
edicated Ce
edicated SC
edicated SIt
edicated
A01RR
edicated
BtMLd01
edicated A
data is clocke
rial Configuratio
f the first byt
c.
mode, CCLK do
48
Serial Configura
Mode Pins – dset via special
Configurationexcept JTAG. F
Serial configuCCLK edge
Serial data out is left uncon
Active High sig0 = FPGA not c1 = FPGA confRefer to the BReference Gui
Before the Mohat can be heMode pins areLow output induring configu0 = CRC error1 = No CRC er
Active‐Low as
ed into Virtex
on Clocking Sequ
te. For examp
oes not trans
ation Interface P
Des
determine col DIP‐switches
clock sourceFor Master se
ration data in
tput for downnected in ou
gnal indicatinconfigured figured BitGen sectionide for softwa
ode pins are seld Low to dee sampled, INndicating wheuration:
rror
synchronous f
x‐5 devices in
uence. Taken fr
ple, if the firs
sition until aft
Pins
scription
nfiguration ms on the evalu
e for all configerial it is an o
nput, synchro
nstream daisur design.
ng configurati
n of the Deveare settings.
sampled, INITelay configuraNIT_B is an opether a CRC er
full‐chip reset
n Master Seria
om [31]
st byte is 0xAA
ter the Mode
mode. They cauation board
guration modoutput.
onous to rising
sy‐chained de
ion is comple
lopment Syst
T_B is an inpuation. After thpen‐drain actirror occurred
t
al mode.
A (1010_1010
e pins are sam
an be .
es
g
evices.
te:
tem
ut he ive d
0), bit
mpled,
3. C
The Mast
PROM, as
Notes rele
1. Th
D
B
2. Th
3. Th
4. Th
ca
5. Th
6. O
ac
CLK can be fr
ter Serial mo
s shown in Fig
evant to Figu
he DONE pin
ONE pin has
itGen.
he INIT_B pin
he BitGen sta
he PROM in t
ascaded to in
he BIT file mu
On some Xilin
ctive Low wh
ee running in
ode is designe
gure 29.
Figure 29
re 29:
n is by defau
s a programm
n is a bidirecti
artup clock se
this diagram
crease the ov
ust be reform
nx PROMs, th
en using this
n Slave serial m
ed so that th
Master Serial M
ult an open‐d
mable active
onal, open‐d
etting must be
represents o
verall configu
matted into a P
he reset pola
setup.
49
mode.
he FPGA can
Mode Configurat
drain output
driver. To e
rain pin. An e
e set for CCLK
ne or more X
ration storag
PROM file bef
arity is progr
n be configur
tion. Taken from
requiring an
nable it, ena
external pull‐u
K for serial co
Xilinx PROMs
ge capacity.
fore it can be
rammable. R
red from a X
m [31]
n external pu
able the Driv
up resistor is
nfiguration.
s. Multiple Xil
e stored on th
RESET should
Xilinx configur
ull‐up resistor
e DONE opti
required.
linx PROMs c
he Xilinx PROM
be configur
ration
r. The
ion in
can be
M.
red as