by christopher mark wilson

145
MODELING AND MITIGATION FOR HYBRID SPACE COMPUTERS By CHRISTOPHER MARK WILSON A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2018

Upload: others

Post on 19-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

MODELING AND MITIGATION FOR HYBRID SPACE COMPUTERS

By

CHRISTOPHER MARK WILSON

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

© 2018 Christopher Mark Wilson

To my family for all their support, and especially Dave for not letting me make bad decisions.

4

ACKNOWLEDGMENTS

This research was funded by industry and government members of the National Science

Foundation (NSF) Center for Space, High-performance and Resilient Computing (SHREC),

formerly known as the Center for High-performance Reconfigurable Computing (CHREC) and

its I/UCRC Program under Grant Nos. IIP-1161022 and CNS-1738783. The author thanks Alan

George as co-author and advisor for all the dissertation research.

For extensive contributions for the success of the STP-H5/CSP mission the authors wish

to especially thank Dylan Rudolph and Jacob Stewart (PCB design), Patrick Gauvin (flight

software), James MacKinnon (instrument design), Antony Gillette (ground-station development),

Darlene Brown (procurement), Alex Wilson, Aaron Stoddard, and Dr. Mike Wirthlin (scrubbing

and radiation testing), Gary Crum (systems engineering support) and Tom Flatley (mission

design).

More than a dozen students contributed heavily to the hardware and software

development of CSPv1 and to the STP-H5/CSP mission. Special thanks to James Coole, Ed

Carlisle, Bryant Cockcroft, Sebastian Sabogal, Daniel Sabogal, Jonathan Urriste, Dorothy Wong,

Brad Shea, Christopher Morales, Andy Wilson, Jordan Anderson, Ryan Zavoral, Rainer

Ledesma, Travis Wise, Jay Wang, Joe Kimbrell, and Dr. Herman Lam for their contributions.

We also wish to thank all the additional support received from NASA Goddard for software

development, environmental testing, mechanical design, and design reviews. This group includes

Elizabeth Timmons, Jaclyn Beck, Alessandro Geist, Keven Ballou, Dave Petrick, Mike Lin,

Allison Evans, Matt Colvin, Eric Gorman, Tracy Price, Curtis Dunsmore, and Katie Brown. We

would also like to thank the operations support provided by STP, specifically Robert Plunkett,

Zachary Tejral, and William Lopez. Finally, we’d like to thank Brandon Reddell and Kyson

5

Nguyen of NASA JSC’s EV511 group for supporting the expensive heavy-ion radiation tests at

BNL.

For assistance in the development of the hybrid modeling methodology, the author

wishes to thank Ben Klamm, Jacob Stewart, Ed Carlisle, and Pete Sparacino for their expertise

and input toward developing this methodology. In addition, we would like to thank Nick Wulf,

Dr. Ann Gordon-Ross, Dr. Michael Wirthlin, Alex Wilson, Dan Espinosa, and Dave Petrick for

their support and review. Finally, we would like to thank Mike Campola, Ray Ladbury, and Ken

LaBel of NASA Goddard Code 561 for input, feedback, and review.

For assistance in development of the new hybrid, fault-tolerant framework, the author

wishes to thank Sebastian Sabogal for extensive FPGA development and extending the work in

future papers. Additionally, the author thanks Jason Gutel for preliminary AMP development,

Adam Jacobs for guidance and knowledge related to RFT, Ed Carlisle for assistance in simple

verification experiments, David Wilson for initial prototype studies, and Tyler Lovelly and Andy

Milluzzi for providing device metrics for the Zynq and MicroBlaze. The author thanks John

McDougall at Xilinx for providing BSPs for AMP.

The author also gratefully acknowledges donations and support from the following

vendors and organizations that helped make this work possible: Xilinx for development licenses

and web ticket support; Intersil, Texas Instruments, Microsemi Corporation, Cobham, and e2V

for supplying key components that comprise the designs; and Department of Energy and Cisco

for supporting the LANSCE and TRIUMF radiation tests, respectively.

6

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...............................................................................................................4

LIST OF TABLES .........................................................................................................................10

LIST OF FIGURES .......................................................................................................................11

ABSTRACT ...................................................................................................................................14

CHAPTER

1 INTRODUCTION ..................................................................................................................16

2 BACKGROUND RESEARCH ..............................................................................................21

Space-Radiation Environment ................................................................................................21

Radiation Mitigation Programs and Processes .......................................................................22 Radiation Hardness Assurance (RHA) ............................................................................23 Single-Event Effects Criticality Analysis (SEECA) .......................................................24

NASA Electronic Parts and Packaging (NEPP) Program ...............................................24 Example NASA CubeSat Part Selection Process ............................................................24

Reliability Modeling ...............................................................................................................26 Probabilistic Risk Assessment and Fault-Tree Analysis .................................................26

Dynamic Computer Fault Tree and Markov Models ......................................................27 Types of Computing ...............................................................................................................29

Reconfigurable Computing .............................................................................................29 Hybrid Computing ...........................................................................................................31

Fault-Tolerant Strategies ........................................................................................................32

Symmetric and Asymmetric Multiprocessing (SMP / AMP) ..........................................34 Lockstep Operation .........................................................................................................36 Reconfigurable Fault Tolerance (RFT) ...........................................................................36

Radiation Tolerant SmallSat (RadSat) Computer System ...............................................37 Space Test Program Houston-5 (STP-H5) .............................................................................38

Space Test Program .........................................................................................................39 ISS SpaceCube Experiment Mini (ISEM) .......................................................................39

3 SMALL SPACECRAFT COMPUTING ................................................................................42

SmallSats and CubeSats Overview .........................................................................................42 SmallSat Technology State of the Art ....................................................................................44

SmallSat Computing vs. Traditional Spacecraft Computing ..................................................46 Challenges to SmallSat Computing ........................................................................................48 Better Computing with Hybrid Approach ..............................................................................50

4 CONCEPTS OF HYBRID, RECONFIGURABLE SPACE COMPUTING .........................51

7

5 RELIABILITY METHODOLOGY FOR SMALLSAT COMPUTERS ...............................53

Methodology Stages ...............................................................................................................54

Stage 1: Component Analysis .........................................................................................54 Stage 2: Radiation Data Collection .................................................................................54 Stage 3: Mission and Model Parameter Entry .................................................................57 Stage 4: Fault-Tree Construction, Iteration, and Modification .......................................58 Mitigation Guidelines ......................................................................................................65

CSPv1 Analysis ......................................................................................................................66 Case Study: Description and Assumptions .....................................................................66 Case Study: Results and Analysis ..................................................................................67

Methodology Insights and Improvements ..............................................................................70

6 CSPv1 DESIGN......................................................................................................................72

Hardware Architecture ............................................................................................................72 Software Design ......................................................................................................................76

Fault-Tolerant Computing Options.........................................................................................77 Design Revisions ....................................................................................................................79

7 PERFORMANCE ANALYSIS OF CSPv1 ............................................................................80

8 RELIABILITY ANALYSIS OF CSPv1 ................................................................................83

Radiation Testing Results .......................................................................................................83

Neutron Testing ...............................................................................................................83

Brookhaven National Laboratory October 2015 Radiation Test .....................................84 Brookhaven National Laboratory October 2016 Radiation Test .....................................85

Radiation Environment Upset Prediction ...............................................................................86

Workmanship Reliability ........................................................................................................87

9 HIGHLIGHTS OF STP-H5/CSP MISSION EXPERIMENT ................................................89

Mission Configuration ............................................................................................................90

Hardware .........................................................................................................................90 Software ...........................................................................................................................90 Ground Station .................................................................................................................92

Primary Mission Objectives ...................................................................................................92

Secondary Mission Objectives ...............................................................................................93 Autonomous Operations ..................................................................................................93 In-Situ Upload Capability ................................................................................................94

Partial Reconfiguration ....................................................................................................94 Space Middleware ...........................................................................................................95 Device Virtualization and Dynamic Synthesis ................................................................96

Preliminary On-Orbit Results .................................................................................................99

10 FAULT-TOLERANT FRAMEWORK FOR HYBRID DEVICES .....................................101

8

HARFT Use-Case and Design Overview .............................................................................101

Flight Example ..............................................................................................................101

HARFT Hardware Architecture ....................................................................................102 Hard-Processing System (HPS) .............................................................................102 Soft-Processing System (SPS) ...............................................................................103

Configuration Manager (ConfigMan) ...........................................................................103 ConfigMan Scrubbing ...................................................................................................104

ConfigMan mode-switching mechanics .................................................................105 ConfigMan mode switching process ......................................................................105

SPS Static Logic ............................................................................................................105 Fault-Tolerant mode switching ..............................................................................106 Mode switching ......................................................................................................107

Challenges ..............................................................................................................107

Flight configuration and use model ........................................................................108 Experiments and Results .......................................................................................................108

Processor Experiments ..................................................................................................108

Basic SMP experiment ...........................................................................................109 Basic AMP experiment ..........................................................................................109

Reliability Modeling ......................................................................................................109

CRÈME96 ..............................................................................................................110 Modeling methodology ..........................................................................................110

HARFT Prototype Description ......................................................................................110 HPS configuration ..................................................................................................110 SPS configuration ...................................................................................................111

ConfigMan configuration .......................................................................................111

Additional hardware configuration ........................................................................111 HARFT Prototype Analysis ..........................................................................................113 HARFT Performance Modeling ....................................................................................115

Framework Status and Future Considerations ......................................................................117

11 CSP SUCCESSORS .............................................................................................................119

µCSP and Smart Modules .....................................................................................................119

Concepts of Smart Modules ..........................................................................................119 µCSP Hardware Architecture ........................................................................................123 µCSP Software Architecture .........................................................................................125 µCSP Fault-Tolerant Architecture .................................................................................126

Smart Module Designs ..................................................................................................127 µCSP Achievement Highlights .....................................................................................129

SuperCSP and STP-H6/SSIVP .............................................................................................129

CSPv2 ...................................................................................................................................132

12 CONCLUSIONS ..................................................................................................................133

9

APPENDIX

SPACE PROCESSORS ........................................................................................................135

LIST OF REFERENCES .............................................................................................................137

BIOGRAPHICAL SKETCH .......................................................................................................145

10

LIST OF TABLES

Table page

5-1 SEU upset rates for non-volatile memory reported by CREME96....................................58

5-2 Typical TID amounts for LEO with 1-year mission reported by SPENVIS .....................58

5-3 Yearly TID by orbit. ..........................................................................................................67

5-4 Estimated board lifetime. ...................................................................................................68

5-5 CSPv1 board upset rate. .....................................................................................................69

5-6 Power system upset/day. ....................................................................................................70

6-1 Xilinx Zynq-7020 ARM specifications. ............................................................................73

6-2 Xilinx Zynq-7020 FPGA specifications ............................................................................73

6-3 CSPv1 Rev. B power consumption. ...................................................................................75

7-1 Computational density and computational density per Watt of popular rad-hard

processors and the Zynq.....................................................................................................81

7-2 CoreMark benchmarking. ..................................................................................................82

9-1 CSP Board Upset Rate. ....................................................................................................100

10-1 PRR resource utilization. .................................................................................................112

10-2 Prototype total resource utilization. .................................................................................112

10-3 FPGA scrubbing duration. ...............................................................................................115

10-4 Computational density device metrics. ............................................................................115

10-5 Zynq processors’ CoreMark benchmarking performance. ..............................................116

11-1 Example components for Smart Modules ........................................................................122

11-2 Major components of µCSP. ............................................................................................124

11-3 SmartFusion2 ARM specifications. .................................................................................125

11-4 SmartFusion2 FPGA specifications .................................................................................125

A-1 SmallSat processors and Single-Board Computers..........................................................135

11

LIST OF FIGURES

Figure page

1-1 SpaceWorks Historical Nano/Microsatellite Launches. ....................................................18

2-1 Simplified fault-tree example in NASA’s Fault-Tree Handbook. .....................................26

2-2 Simple DFT and its equivalent, complex, and large Markov model representation

demonstrating state explosion by Boudali et al. ................................................................28

2-3 ARM processing-configuration illustrations......................................................................35

2-4 Lockstep Operation. ...........................................................................................................36

2-5 RFT Architecture Diagram. ...............................................................................................37

2-6 RadSat FPGA Architecture Layout with Partial Reconfiguration Regions. ......................38

2-7 STP-H5/ISEM flight box 3D model. .................................................................................40

2-8 STP-H5/ISEM fully integrated payload. ............................................................................41

2-9 STP-H5/ISEM card block diagram. ...................................................................................41

3-1 Performance scaled by power comparison of onboard processors. ...................................47

3-2 Costs of commercially available SBCs. .............................................................................48

5-1 Reliability methodology stages. .........................................................................................53

5-2 Statistical structure of representative data. ........................................................................56

5-3 Example cross section vs. LET graph ................................................................................57

5-4 Basic event for a SEU to memory cell in non-volatile memory from heavy ions or

trapped protons...................................................................................................................60

5-5 System-level fault tree with key modules for analysis. .....................................................61

5-6 Expanded memory module. ...............................................................................................61

5-7 Expanded non-volatile memory section.............................................................................63

5-8 Non-volatile memory module with ECC. ..........................................................................64

5-9 Graph generated by Windchill Predictions for case study board failure. ..........................65

5-10 LEO and GEO reliability curves. .......................................................................................69

12

5-11 Power module reliability. ...................................................................................................70

6-1 CSPv1 Rev B. block diagram. ...........................................................................................73

6-2 CSPv1 designs ...................................................................................................................74

6-3 CSPv1 Rev. B mated to Evaluation Boards. ......................................................................76

8-1 CSP at test facilities. ..........................................................................................................85

9-1 STP-H5 Pallet 3D-view and integrated-for-flight system .................................................89

9-2 STP-H5/CSP flight unit. ....................................................................................................89

9-3 CLIF OpenCL Framework. ................................................................................................97

9-4 Example image products from STP-H5/CSP. ....................................................................99

10-1 World Map displaying proton flux at South Atlantic Anomaly. .....................................102

10-2 HARFT architecture diagram ..........................................................................................103

10-3 ConfigMan and SPS-SL architecture diagram. ................................................................106

10-4 Illustrated fault-tolerant modes diagram. .........................................................................106

10-5 FPGA configuration area in floorplan view.....................................................................112

10-6 HARFT reliability with L2 cache disabled. .....................................................................113

10-7 HARFT reliability with L2 cache enabled. ......................................................................114

10-8 Upsets per day vs. performance with L2 cache disabled. ................................................117

10-9 Upsets per day vs. performance with L2 cache enabled ..................................................117

11-1 Example template for Smart Module. ..............................................................................120

11-2 Integration and mating with a Smart Module. .................................................................121

11-3 Ring network connection for Smart Module. ..................................................................122

11-4 µCSP computer board testing prototype. .........................................................................123

11-5 Example of 6U CubeSat wiring harness. .........................................................................128

11-6 SuperCSP backplane with 4 CSPv1s. ..............................................................................130

11-7 Deconstructed view of STP-H6/SSIVP flight box...........................................................131

13

11-8 Fully assembled flight box for environmental testing. ....................................................132

14

Abstract of Dissertation Presented to the Graduate School

of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Doctor of Philosophy

MODELING AND MITIGATION FOR HYBRID SPACE COMPUTERS

By

Christopher Mark Wilson

May 2018

Chair: Alan Dale George

Major: Electrical and Computer Engineering

Space is a hazardous environment for electronic systems, therefore, the traditional

approach to space computing relies upon radiation-hardened electronics, which are

characteristically more expensive, larger, less energy-efficient, and generations behind in

performance and functionality of modern commercial processors. Conversely, modern

commercial processors, while providing the utmost in performance and energy-efficiency, are

susceptible to space radiation. The desire for more autonomous missions combined with growing

demands for more detailed products from advanced sensors, have challenged organizations to

stay relevant by “doing more with less” to meet future requirements. To meet this need,

researchers at the National Science Foundation Center for High-Performance Reconfigurable

Computing have developed a new design concept in hybrid space computing that features fixed

and reconfigurable processor architectures merged with innovative system design that is a

combination of three technologies: commercial devices; radiation-hardened components; and

fault-tolerant computing.

To model the reliability of these new hybrid designs a novel methodology was developed

for estimating reliability of space computers for small satellites from the system-level

perspective. This methodology is useful in scenarios where funding, time, or experience for

15

radiation testing are scarce. The output values of the method can then be used to build a first-

order estimate on how well the system performs given specific mission-environment conditions.

Additionally, due to the complexity of including a hybrid-processor architecture within a system

design, a new fault-tolerant technique was developed to provide and evaluate tradeoffs in system

reliability, performance, and resource utilization. This fault-tolerant computing strategy accounts

for the hybrid nature of a device and suggests a strategy that works cooperatively between both

types of architectures.

The hybrid space computing concept has culminated in development of several novel

research platforms, most significantly the CSPv1 flight computer, which was successfully

deployed on the International Space Station. Prior to flight, this CSPv1 design was analyzed and

tested on the ground with the new reliability methodology and radiation tests. Recent in-flight

data has validated the design, and shown that CSPv1 exceeds reliability expectations predicted

by models. The new fault-tolerant computing strategy was developed for CSPv1 specifically, and

will be deployed on CSPv1 in future missions.

16

CHAPTER 1 INTRODUCTION

Future spacecraft technology will be defined by emphasizing the importance of creating

highly reliable and more affordable space systems. Prohibitive launch costs and increasing

demands for higher computational performance have promoted a rising trend towards

development of smaller spacecraft featuring commercial technology on higher-risk missions and

less-stringent standards as exemplified by NASA Ames Research Center’s Phonesat mission [1]

and a survey of small spacecraft technology by Allman and Petro [2]. Enabled by these

advancements, it is now feasible for a group of small satellites (SmallSats) to perform the same

mission tasks that would have required a costly, massively sized satellite in the past. This

concept has been extensively studied for different missions [2-3]. The growing importance of

SmallSat missions has been gestating as a future outcome from as early as 2000, as described by

the National Research Council’s (NRC) publication “The Role of Small Satellites in NASA and

NOAA Earth Observation Programs,” [4] and has been now burgeoning in recent years. The

rationale dictated in this study for the advancement of SmallSats largely remains unchanged to

date from its original publication period. The study stressed the benefits of SmallSats, as low-

cost yet capable platforms offering great architectural and programmatic flexibility. Additionally,

the study highlighted unique design features that apply to SmallSats, such as distributed

functions, observation strategies (constellations and clusters), rapid infusion of technology, and

both budget and schedule flexibility.

The gradual progress and direction of spacecraft technology towards SmallSats can be

attributed to NASA’s response to the NRC’s decadal survey for Earth science. The decadal

survey focuses on the needs and priorities of the scientific community to plan on key space-

research areas and missions. In the midterm assessment [5] of the original 2007 survey [6], there

17

are two key findings that can be addressed with new processing capabilities for Small Satellites.

The first main finding: “The nation’s Earth observing system is beginning a rapid decline in

capability as long-running missions end and key new missions are delayed, lost, or canceled.”

This finding describes dwindling numbers of planned and funded, larger Earth

observation satellites. This shortcoming is problematic since Earth science needs more data to

sustain more powerful climate and weather models. A further concern is that next-generation

instruments generating more data will saturate satellite downlink bandwidth. Therefore, a

possible solution is to develop more small satellites that can perform onboard processing when

feasible to allow results to be transmitted in lieu of the entire data set.

Another major finding of the decadal survey: “Alternative platforms and flight

formations offer programmatic flexibility. In some cases, they may be employed to lower the

cost of meeting science objectives and/or maturing remote sensing and in situ observing

technologies.” The alternative platforms mentioned in the survey include small satellites that can

either act independently or work cooperatively to form a distributed science mission.

SmallSats, especially in the range of nano and micro satellites, have rapidly become more

advanced and have been featured in more missions in recent years. This growth has been

attributed to CubeSat (sub-class of SmallSat) research programs started by the National Science

Foundation (NSF), which has incited university participation and a growing commercial interest

from industry for using SmallSats in Earth observations and remote sensing. CubeSats have

become so popular largely due to “comparatively low development costs, miniaturized

electronics, and timely availability of affordable launch opportunities” [7]. Correspondingly, the

number of CubeSat launches has rapidly expanded. SpaceWorks, a company that focuses on

monitoring global satellite activities, publishes studies on its findings annually [8]. In Figure 1-1,

18

SpaceWorks highlights a sudden increase in SmallSats in the 1 to 50 kg range from 2000 until

2016, emphasizing major changes in the space development ecosystem.

Figure 1-1. SpaceWorks Historical Nano/Microsatellite Launches [8].

In 2015, NASA released a technology roadmap to describe the future development efforts

required to create novel, cutting-edge technologies that enable new capabilities for ambitious

future space missions [9]. In this roadmap, there are 15 distinct technology areas (Launch

Propulsion Systems, Science, Instruments, Observatories, Sensor Systems, etc.) relating to

different aspects that comprise space missions. Additionally, they note technology topics that

encompass and overlap multiple areas. One of these domain-crossing technology topics is

avionics, which focuses on the electronic systems that are essential to satellite capabilities.

In 2016, the NRC published a study [10] that investigated all the topics found in the

roadmap to provide recommendations of focus for NASA, ranking the topics in order of

importance, and classifying key topics as “high-priority.” 88 topics were classified as high-

priority technologies, and 26 of those 88 (roughly 30%) are encapsulated by avionics, further

highlighting the significance of computing and processing for space operations. SmallSats can

19

play a crucial role in advancing key technology roadmap topics through technology

demonstration of new computers and systems.

Even though the most popular SmallSat platform (CubeSats) is small, the demands for

advanced science and capabilities are always increasing. Both future missions and spacecraft

have a principal need for high performance and reliability. Therefore, the major challenge

developing future spacecraft is to balance the demands of onboard sensor and science processing

with the limitations of reduced power, size, weight, and cost of a SmallSat platform. Current

SmallSat computing technologies, especially devices found in CubeSats, are prohibitively

limited, often featuring microcontrollers which scarcely approach the processing or reliability

requirements for extensive science objectives. Even SmallSats equipped with more high-

performance, modern processors meeting performance needs, may face reliability concerns due

to hazardous radiation in space environments. SmallSat missions do not amass the funding of

larger spacecraft missions, therefore purchasing state-of-the-art, radiation-hardened (rad-hard)

processors is often infeasible, due to extremely high costs. Additionally, while rad-hard

processors may meet reliability needs, a state-of-the-art, rad-hard processor is relatively

antiquated in terms of energy efficiency and performance compared to most modern commercial

processors. Therefore, rad-hard processors are unable to achieve the computing capability

needed for high-priority tasks in the technology roadmap, especially for compute-intensive

autonomous operations and complex sensor processing. Illustrating the need for reliable

computers meeting mission needs, in his 2015 keynote address [11] to the (AIAA) Small

Satellite Conference, General John Hyten, former Commander of the Air Force Space

Command, noted:

20

“We need to build computers with resilient architectures that meet operational and

mission requirements and logistically support a continued supply chain.”

This dissertation presents a survey of the challenges and opportunities of onboard

computers for small satellites and focuses upon new concepts, methods, and technologies, to

provide next-generation missions with the performance and reliability required to meet their

objectives. In this dissertation, we describe a novel, hybrid-computing concept to develop next-

generation spacecraft computers. This concept can be used to address key findings of the decadal

survey, as well as reinforce concepts highlighted in the CubeSat survey. The culmination of this

research is a CubeSat form-factor, multifaceted-hybrid computer, the CHREC Space Processor

v1 (CSPv1) [12], which is designed to scale to meet mission needs of varying spacecraft, from

CubeSats up to larger satellites.

The organization of the paper is as follows. In Chapter 2, we give the relevant

background of the enabling programs, concepts, techniques, and tools related to hybrid space

computing. Chapter 3 describes the current state of small spacecraft computing and provides the

rational for the concept presented in this dissertation. Chapter 4 describes the overall hybrid

computing concept, known as CSP the concept. In Chapter 5 introduces the reliability

methodology developed for hybrid design analysis. Chapter 6 we present the hardware, software,

and fault-tolerant design of the CSPv1. Chapter 7 provides a performance analysis of CSPv1. In

Chapter 8, we describe radiation-testing, radiation-modeling, and workmanship-reliability results

used to validate the flight system. Chapter 9 discusses the first CSPv1 mission and preliminary

results. Chapter 10 describes a novel fault-tolerant framework for hybrid designs. In Chapter 11,

we highlight the successors to CSP research. Finally, Chapter 12 provides concluding remarks.

21

CHAPTER 2 BACKGROUND RESEARCH

This chapter focuses on providing background and related works to understand the design

decisions, techniques, and methodologies presented in this dissertation. This chapter provides a

cursory overview of the challenges and concerns for electronics in a radiation environment and

introduces small-spacecraft technology. Also provided are recommendations, research, and

programs focusing on radiation mitigation. For radiation modeling, this chapter presents an

overview of probabilistic risk assessment and fault-tree analysis. This chapter further defines the

scope and concepts of both reconfigurable and hybrid architecture, as well as, fault-tolerant

computing techniques applied to those designs and closely related works. Finally, this section

describes the programs that have supported the first CSP mission.

Space-Radiation Environment

Unlike terrestrial environments, space presents electronics with a host of challenges for

reliability due to the effects of radiation. The principal challenge for sustained, reliable

computing in space arises from the environmental hazards of radiation to electrical, electronic,

and electromechanical (EEE) parts. EEE parts in space can be exposed to a wide range of

radiation environments, each with considerably different types of particles and fluences, which

lead to varying responses from negligible degradation or benign interrupt to complete and

catastrophic failure. There is no generalized or common-case space environment; therefore,

radiation effects must be analyzed on a per-mission basis.

Particles encountered in space can originate from several sources including Earth’s

magnetic field, Galactic Cosmic Rays (GCRs), and solar-weather events. Earth’s magnetic field

primarily consists of low-energy charged particles (electrons and protons) and some heavy ions.

Galactic Cosmic Rays originate from outside the solar system and are primarily protons and

22

alpha particles, however, heavy ions are also present in comparably low numbers. Finally, solar-

weather events consist of solar winds, solar flares, and coronal mass ejections (CME), which are

predominately protons and a small fraction of heavy ions.

When these particles interact with electronic components, the effects can be generally

classified into two categories: long-term cumulative effects; and short-term transient effects

(commonly described as Single-Event Effects). Cumulative effects include a buildup of total

ionizing dose (TID) levels, ionization of circuits, enhanced low-dose-rate sensitivity (ELDRS),

and displacement-damage dose (DDD). The single-event effects (SEE) category includes single-

event upsets (SEU), single-event transients (SET), single-event latchups (SEL), single-event

burnouts (SEB), single-event functional interrupts (SEFI), and lastly single-event gate ruptures

(SEGR). EEE components (even an identical device from a different lot) can react differently to

radiation, and experience different effects more prominently. Radiation-effects testing is a broad

field with extensive studies on the complex relationship of various devices (including processors)

to radiation. These radiation effects and the space environment are covered in detail by many

organizations [13-19]. Space-processor designers must consider these effects carefully when

designing a system to operate within a hazardous space environment.

Radiation Mitigation Programs and Processes

Due to the severity of space radiation to components NASA created several efforts to

perform research and make recommendations for space designs. These include Radiation

Hardness Assurance (RHA) process, Single-Event Effects Criticality Analysis (SEECA), the

NASA Electronic Parts and Packaging (NEPP) program, and finally the CubeSat Part Selection

Process.

23

Radiation Hardness Assurance (RHA)

Due to the complex response of emerging COTS technologies to radiation, NASA has

developed an approach to developing reliable space systems which strive to address critical

arising issues, including displacement damage dose (DDD), enhanced low dose rate sensitivity

(ELDRS), proton damage enhancement (PDE), linear transients, and other catastrophic single-

event effects. This methodology is referred to as Radiation Hardness Assurance (RHA) for Space

Flight Systems [20]. NASA’s definition is presented:

“RHA consists of all activities undertaken to ensure that the electronics and materials of a

space system perform to their design specifications after exposure to the space environment.”

RHA encompasses mission systems, subsystems, environmental definitions, part

selection, testing, shielding, and fault-tolerant design. This paper builds upon key stages of the

programmatic methodology presented by RHA.

The main stages of the RHA process include:

1. Defining the hazard

2. Evaluating the hazard component

3. Defining requirements

4. Evaluating device usage

5. “Engineering” with designers

6. Iterate the process throughout mission lifetime

One of the goals in the RHA process is to enable a small work group to address radiation

reliability issues related to COTS and emerging technology while supporting a large number of

projects. The RHA process is also significant because it addresses major issues with risk-

assessment approaches including pitfalls, limitations, and recommendations. This process also

addresses the realities of risk assessment and offers some key guidelines to provide an analysis

when there are so many unknowns and so much knowledge involved with radiation effects [20-

23].

24

Single-Event Effects Criticality Analysis (SEECA)

SEECA is a NASA document that offers a methodology to identify the severity of an

SEE in a mission, system, or subsystem, and provides guidelines for assessing failure modes.

The document pulls together key descriptive elements of single-event effects in microelectronics

and the applicable concepts to help in risk analysis and planning. SEECA is one of the key

components of RHA described above. SEECA is a specialized Failure Modes and Effects

Criticality Analysis (FMECA) study. FMECA offers valuable analysis and insight through

inductive analysis, which can be used to enhance models and techniques used in Probabilistic

Risk Assessment (PRA) [22].

NASA Electronic Parts and Packaging (NEPP) Program

NASA has a group dedicated to studying any EEE parts for space use including COTS

components. NEPP and its sub-group, the NASA Electronic Parts Assurance Group (NEPAG),

provide agency-wide infrastructure for guidance on EEE parts for space usage. Their domains of

expertise encompass qualification guidance (both manufacturer and parts), technology

evaluations, standards, risk analysis, and information sharing. The entire program is covered in

[23]. Our presented methodology is complementary to NEPP methods. This paper describes a

complete methodology that adds methods for system-level analysis, whereas NEPP analysis is

primarily focused on individual parts qualification and does not account for board- or system-

level, fault-tolerant analysis.

Example NASA CubeSat Part Selection Process

This section describes an example part-selection process when designing and selecting

components for a CubeSat processor. Initial component selection is an important pre-stage to the

methodology presented in this paper, which already assumes a bill-of-materials and component

list has been established. This section describes an agnostic approach to part selection with

25

respect to performance requirements found in programs at both NASA Ames and NASA

Goddard centers and relayed by NASA engineers through personal communication.

The following is a list of general recommendations to follow while keeping both schedule

and budget in close consideration:

• Maintain a mass and volume budget margin for spot/sector shielding directly proportional

to both the expected dose and electronic system mass.

• Select parts from a reference board design that has successfully flown in a previous

mission of equivalent mission duration.

• Select components in the following general flow: radiation hardened by design >

radiation hardened > radiation tolerant > military > automotive > industrial >

commercial.

• If commercial components are selected, choose the components that have radiation

hardened or tolerant equivalents. These components typically have lower burn-in failure

rates, and can be swapped for their radiation-hardened counterparts if necessary.

• Select commercial components that have the same dies as radiation-hardened or tolerant

products.

• Use components built on wider band gap substrates (including resistors) and/or with

wider band gap active regions.

• Use MRAM instead of Flash memory architectures.

• Use p-type MOSFETs instead of n-type.

• Use BJTs instead of MOSFETs if allowable.

• Select components with a higher gate voltage and lower operational voltage.

• Embed watchdog features, filters, and reset capability into each subsystem.

It should also be noted that components have other issues to consider not related to radiation. An

extensive requirements document is described by Sahu [24].

26

Reliability Modeling

Even if a designer understands the effects of radiation on the relevant components, the

designer must be able to use the information to create models. This section describes the

modeling approach chosen for the radiation methodology described in this dissertation.

Probabilistic Risk Assessment and Fault-Tree Analysis

A key component of this paper is based around Probabilistic Risk Assessment (PRA).

PRA is a systematic methodology for evaluating risks associated with a complex engineering

technological entity. PRA is typically used to determine what can go wrong with the studied

technological entity and what are the initiating events, how severe and what are the

consequences of the initiating event, and how likely are the consequences to occur. Over the past

few decades, PRA and its included techniques have become both respected and widespread for

safety assessment [25].

Figure 2-1. Simplified fault-tree example in NASA’s Fault-Tree Handbook [26].

Fault-Tree Analysis (FTA) is a logic and probabilistic technique used in PRA for system-

reliability assessment. FTA is an analytical approach in nature. It works by specifying an

undesired or failure state, and then analyzing the system to find all the possible ways the failure

state might occur. The usefulness of this approach is that the fault/error events can be represented

27

as hardware failures, human errors, software errors, or any related events. Graphically, a fault

tree has a single top event which is a specific failure mode; below it are events that may occur,

and logic gates are included which show the relationships of lower-level events that form higher

events that will eventually lead to the top failure event. A simple example fault tree is presented

in Figure 2-1, where D failing represents the top failure event, and A, B, and C failings represent

component failures. FTA became more prevalent in usage around the space community after the

1986 space shuttle Challenger disaster, when the importance of reliability-analysis tools like

PRA and FTA were realized [26].

Dynamic Computer Fault Tree and Markov Models

The standard fault-tree approach is not robust enough to properly reflect more complex

computer systems, where the failure mode is highly dependent on the order of failures in the

system (e.g., cold spare swaps). To enhance the FTA approach, the Dynamic Fault Tree (DFT)

methodology has been specifically developed for the analysis of these complex computer-based

systems. The DFT methodology provides a means to combine FTA with Markov modeling

analysis which is commonly used in reliability modeling for fault-tolerant computer systems.

Markov models can easily reflect sequence-dependent behavior that is associated with fault-

tolerant systems. There are disadvantages of using Markov models alone, as they can be tedious

to create, error prone, and suffer from drastic size increases as more states are added known as

state explosion. Figure 2-2 displays a DFT for a road trip failing and its equivalent Markov

model that has become needlessly complex due to state explosion.

28

Figure 2-2. Simple DFT and its equivalent, complex, and large Markov model representation

demonstrating state explosion by Boudali et al. [27].

In the NASA fault-tree handbook [26], it is demonstrated that a large system-level fault

tree can be segmented off into smaller, independent modules solved separately, and then

recombined for a complete analysis. Certain trees can be solved faster as a DFT than as a

Markov model, but for some complex component interactions, the Markov model may be more

appropriate. In this case, a Markov model can be created and re-integrated into the fault tree.

DFT and FTA have other uses; the most significant of these can be calculating different

importance measures. These can help identify the contribution a specific element makes to the

top-event probability, the amount of reduced risk if an event is assured not to occur, the

probability of a top gate failure if a lower gate was assured not to occur, and finally the rate of

change in the top event if there is a rate of change in a lower event. These significance measures

can greatly aid the part selection process and expose potential weaknesses in a design.

There are limitations, however, to the fault-tree model. The fault-tree model is not

exhaustive, and can only cover the faults that have been considered by the analyst [27-31].

29

Types of Computing

The hybrid space-computing architecture described in this dissertation relies on several

different types of computing. In this section, an overview is provided for reconfigurable

computing, hybrid computing, and fault-tolerant computing.

Reconfigurable Computing

Reconfigurable computing is a subset of computer architecture that focuses upon devices

with adaptive designs that can be programmed to create different architectures and circuits. The

devices most commonly associated with reconfigurable computing are field-programmable gate

arrays (FPGAs). There are several advantages of using an FPGA over a general-purpose CPU or

microprocessor. Firstly, FPGAs enable a designer to create custom, application-specific

architectures to exploit algorithmic parallelism. Also, FPGAs are typically more energy-efficient

than a general-purpose processor, enabling a designer to achieve massive computational speedup

on an application while consuming less energy. In addition, due to the flexible, reconfigurable

design of the architecture, FPGAs are frequently employed to interface multiple high-bandwidth

sensors to a system (commonly referred to as “interface glue logic”), since designers can

configure the input/output pins as needed.

FPGAs are desirable for use in space because many space applications, such as synthetic

aperture radar (SAR), hyperspectral imaging (HSI), image processing, and image compression,

are highly amenable to parallelization within an FPGA. This approach enables missions to

perform critical data processing onboard, which can preserve transmission bandwidth, as

opposed to transmitting an entire dataset for processing on the ground. Additionally, some

FPGAs support more flexibly with run-time reconfiguration of sections of the architecture with a

feature known as partial reconfiguration (PR).

30

Partial reconfiguration is the process of reconfiguring a specialized section of the FPGA

during operational runtime. In Xilinx devices, PR is possible through a modular design technique

known as partitioning. In the typical FPGA programming process, FPGA configuration memory

is programmed with a bitstream that specifies the design. In PR, partial bitstreams are loaded into

specific reconfigurable regions of the FPGA without compromising the integrity of the rest of the

system or interrupting holistic system operation. There are many benefits to using PR in space

applications and missions. A designer can use PR to reduce the total area utilization of the FPGA

design by swapping designs in the PR region instead of statically placing all designs in the

design simultaneously. This scheme reduces the required amount of configuration memory and

FPGA resources used, which in turn reduces the area vulnerable to SEEs. Correspondingly, a

decrease in area also decreases power consumption for the device, which is valuable in small-

satellite missions with particularly pressing power constraints. PR is a key component of several

FPGA fault-tolerant computing strategies that designers can use in space. Finally, due to the

smaller storage size of a partial bitstream (compared to a full bitstream), PR allows for faster and

easier transfer of new applications to a device, enabling the spacecraft to conduct new, secondary

mission experiments. Xilinx provides more details for partial reconfiguration on the Zynq [32-

33].

Unfortunately, while more powerful, commercial SRAM-based FPGAs are sensitive to

radiation in space. FPGAs are highly reconfigurable, and rely on their configuration memory to

store the configuration data that describes the custom-designed architecture. Radiation strikes are

a critical concern for SRAM-based FPGAs because they could cause an SEU, which is a change

in memory state, corrupting the configuration memory. The FPGA could malfunction or operate

31

against specifications due to configuration memory corruption. FPGAs their interactions with

radiation effects are extensively described in multiple references [34-36].

Hybrid Computing

We define hybrid computing as a mix of dissimilar computing technologies to gain their

collective advantages. Examples of hybrid computing are: (1) a hybrid-processor combination of

dissimilar device architectures, such as a general-purpose CPU combined with an FPGA on the

same chip or on the same board; or (2) a hybrid-system combination of rad-hard devices with

higher-grade commercial devices to simultaneously achieve high reliability and performance.

Hybrid-processor architectures are gaining popularity in the commercial computing

industry. System-on-chip (SoC) devices are the most prevalent examples of hybrid-processor

architectures. These devices combine several predesigned “blocks” onto a single chip. These

blocks can be embedded processors, memory blocks, interface blocks, and a variety of other

components [37]. SoCs have become popular in mobile devices, embedded systems, and

consumer electronics due to their low power, high performance, and ease of system integration.

For this research, the SoC devices of interest are those that specifically adapt and integrate

multiple computing architectures, such as a combination of CPUs, GPUs, FPGAs, and DSPs.

Common examples of these architectures are Nvidia’s Tegra K1, X1, and X2 (CPU+GPU) [38],

Xilinx’s Zynq (CPU+FPGA) [39], and TI’s Keystone I and II (CPU+DSP) [40]. The main

attraction of these architecture combinations is to partition applications and algorithms onto the

portion of the device for which they are best suited to achieve performance gains. Jacobs et al.

deconstruct a common space application, hyperspectral image processing (HSI), into stages and

describe how the application could be accelerated with hybrid architecture [41]. In that paper,

target detection and classification on a hyperspectral image can be divided into three stages

(metric calculation, weight computation, and target classification). The metric calculation and

32

target classification stages exhibit a large amount of fine-grained parallelism that can be best

exploited by an FPGA. The middle stage (weight computation), however, is sequential in nature

and best suited for a traditional CPU. A hybrid device like the Zynq can perform the entire app

on a single device.

Just as hybrid-processor designs seek to exploit the benefits of different computing

architectures for processing, hybrid-system design focuses on the advantage of balancing the

benefits of commercial and rad-hard devices for reliability and performance. Commercial

devices have the energy, cost, and performance features of the latest technology advancements;

however, these devices are commonly susceptible to radiation effects in space. Commonly,

commercial components do not have flight heritage or radiation-response data. Radiation-

hardened and radiation-tolerant devices are relatively immune to radiation, but are more

expensive, physically larger, harder to procure, and are often technology generations behind in

both performance and functionality. Hybrid-system design seeks to use commercial devices,

augmented by fault-tolerant computing strategies, and combined with radiation-hardened

devices, to achieve the best characteristics of both devices.

Fault-Tolerant Strategies

Space systems incorporate a variety of fault-tolerant computing techniques for reliable

operation in space. Traditional fault tolerance in computing is reflected by redundancy in

hardware, information, network, software, or time. Appropriate mission fault tolerance is a

complex system-design challenge, because fault tolerance always introduces tradeoffs in

hardware, software, performance, and cost.

Hardware redundancy is provided by incorporating additional hardware into the design,

such as having three processors instead of one performing the same function (known as triple-

modular redundancy). Information redundancy is exemplified by error-detection and correction

33

coding (EDAC), error-correcting codes (ECC), cyclic redundancy check (CRC), algorithm-based

fault tolerance (ABFT), and parity checking. Network redundancy relies upon redundant network

links and paths within the topology. Software redundancy is a broad category of fault tolerance,

with checkpoint and recovery as well as exception handling being prominent examples. Finally,

time redundancy is accomplished through repeated execution of the same program on hardware,

which is primarily used to counter transient faults. The field of fault-tolerant or dependable

computing is extensive more information can be found Koren and Krishna [42].

Commonly employed fault-tolerant techniques for ASICs and general-purpose processor

include techniques to protect memory and logical elements. These elements include general-

purpose registers, the program counter, Translation Lookaside Buffer (TLB) entries, memory

buffers, or the branch predictor, and they can be upset by radiation, causing a variety of adverse

effects [43]. SEEs in a processor can manifest as a program crash, a hanging process, a data

error, an unexpected reset, or performance degradation [44].

Due to their unique architecture, FPGA devices retain their own fault-tolerant computing

strategies. The main source of radiation concerns for SRAM-based FPGAs is corruption in the

device-routing configuration memory and app-oriented block RAMs. Configuration memory

allows the FPGA to maintain its pre-programmed, architecture-specific design; therefore, an

upset to configuration memory can dramatically change the desired function of the device. These

memory structures along with flip-flops are particularly vulnerable to radiation. To counter

errors with radiation effects, designers employ configuration memory scrubbing. Scrubbing is

the process of quickly repairing configuration-bit upsets in the FPGA before they render the

device inoperable [45]. Additionally, designers use ECC and parity schemes for block RAMs

and some FPGA configuration memory. Finally, a common approach is to triplicate design

34

structures in the FPGA using triple-modular redundancy (TMR). Several references [34-36]

provide examples of these strategies.

In preparing for missions, designers should analyze their use of fault tolerance in

consideration of mission requirements, since space environmental conditions vary with mission

orbit. For example, certain missions may have a short duration and, therefore, parts can be

selected that have much shorter lifetimes due to radiation, which would not be considered in a

longer, multi-year mission. Space systems must also prioritize fault avoidance such as parts

screening to avoid selecting those that are known to catastrophically fail due to radiation effects.

The following subsections focus on the key techniques that comprise the new hybrid

fault-tolerant strategy specifically targeting the Xilinx Zynq SoC. These strategies include

switching between symmetric and asymmetric processing modes, lockstep operation for a

processor, Reconfigurable Fault Tolerance, and finally partial reconfiguration with spare

processor swapping with the RadSat mission.

Symmetric and Asymmetric Multiprocessing (SMP / AMP)

The Zynq is a highly capable device due to the hybrid nature of its SoC design including

both ARM cores and FPGA fabric. So far, this paper has only considered techniques applicable

to the FPGA fabric; therefore, this section describes unique capabilities available to the ARM

processing system. The ARM cores on the Zynq are capable of running a variety of Linux (and

other) operating-system kernels. The default configuration for running Linux on a development

board is symmetric multiprocessing (SMP) mode. SMP is a processing model that consists of a

single operating system controlling two or more identical processor cores symmetrically

connected to main memory and sharing system resources. This type of configuration is beneficial

for running applications configured for multithreaded processing. SMP makes it possible to run

several software tasks concurrently by distributing the computational load over the cores in the

35

system. Asymmetric multiprocessing (AMP) differs from SMP in that the system can include

multiple processors running a different operating system on each core. Typical examples include

a more full-featured operating system running on one processor, complemented by a smaller,

lightweight, efficient kernel running on the other processor [46-47]. Figure 2-3 demonstrates the

difference between the configurations. There are many potential benefits for this type of

operation [48], including:

• Allows a designer to segregate flight system operations and science applications for

system integrity

• Provides the ability to create a lightweight virtual machine on the system

• One core can be isolated as a secure-software zone for security applications

• The secondary core can also provide a real-time component to system by running

FreeRTOS or other lightweight, real-time operating systems

• AMP allows for additional fault-tolerant techniques by setting up the system for duplex

with compare

• The secondary core also provides easier certification for applications due to smaller

codebase size for review

A B

Figure 2-3. ARM processing-configuration illustrations. A) SMP configuration. B) AMP

configuration.

36

Lockstep Operation

In addition to the division of cores with AMP, lockstep operation is another type of fault

tolerance that designers can apply to CPUs. Lockstep operation is simplistically an extension of a

single core with hardware checking [49]. Lockstep systems run the same operations in parallel.

Figure 2-4 is a graphical depiction of the lockstep process. Lockstep systems detect and correct

operation errors by comparing the outputs of the cores dependent on the number of systems that

are in lockstep [50].

Figure 2-4. Lockstep Operation.

Reconfigurable Fault Tolerance (RFT)

Another technique that builds on PR-based hardware is RFT. This framework [36],

described by Jacobs et al., seeks to enable a system to autonomously adapt and change fault-

tolerant computing modes based on current environmental conditions. In this system, the

architecture uses PRRs in parallel to create different redundancy-based, fault-tolerant modes,

such as duplex with compare (DWC) and TMR. Other mitigation techniques include algorithm-

based fault tolerance (ABFT) and watchdog timers. In their framework, the internal processor

evaluates the current performance requirements and monitors radiation levels (with an external

sensor, or by monitoring configuration upsets) to determine when the operating mode should be

switched. The overall contribution of their strategy is that it allows a system to maintain high

37

performance by swapping in various hardware accelerators in the PRRs, however, when

environmental conditions deteriorate, the system can program critical applications into the

regions with varying levels of redundancy and fault tolerance. Figure 2-5 illustrates the RFT

architecture.

Figure 2-5. RFT Architecture Diagram [36].

Radiation Tolerant SmallSat (RadSat) Computer System

Radsat [51], a commercial-off-the-shelf (COTS) CubeSat developed by Montana State

University and NASA Goddard Space Flight Center (GSFC), is one example that demonstrates

PR-based fault tolerance. RadSat focuses on unique fault-tolerant computing methods for the

Virtex-6 FPGA. Here, the Virtex-6 is not an SoC, and so all necessary software is executed on

softcore processors (CPUs created with FPGA resources), such as the Xilinx MicroBlaze.

38

Figure 2-6. RadSat FPGA Architecture Layout with Partial Reconfiguration Regions [51].

In their proposed system, the FPGA fabric has multiple partially reconfiguration regions

(PRRs), where three of the regions run MicroBlazes in TMR, while the remainder of the PRRs

are spare regions. With this technique, when the TMR system detects a fault, the damaged region

is replaced with a spare region and is reprogrammed in the background using PR. To mitigate

other faults, the scrubber performs blind scrubbing (simple periodic configuration writeback

without checking for errors) on the PRRs, while deploying readback scrubbing (scrubbing while

reading back the contents of a frame to check for errors) through the rest of the static region of

the fabric. Figure 2-6 depicts the RadSat architecture layout and placement blocks for the PRRs.

Space Test Program Houston-5 (STP-H5)

The work presented in dissertation was thoroughly evaluated with the successful launch

of the first mission of the CSPv1 as a sub-experiment. This section describes the test program

that allowed the experiment to gain flight heritage, and the main experiment CSP is integrated

with.

39

Space Test Program

The Space Test Program serves the Department of Defense (DoD) and its space science

and technology community as the main provider of spaceflight. Officially, it is chartered by the

Office of the Secretary of Defense to serve as: “...the primary provider of mission design,

spacecraft acquisition, integration, launch, and on-orbit operations for DOD's most innovative

space experiments, technologies, and demonstrations.” Formed in 1965, the Space Test Program

has been providing access to space for the DoD development community, and is responsible for

many of the military-satellite programs flying today [52, 53].

The Space Test Program Houston office is the sole interface to NASA for all DoD

payloads on the International Space Station (ISS), and other human-rated launch vehicles, both

domestic and international. The office’s main goals are to provide timely spaceflight, to assure

that the payload is ready for flight, and to provide management and technical support for the

safety and integration processes [54]. The CSP flight experiment is included on the fifth iteration

of these missions known as Space Test Program – Houston 5 (STP-H5). STP-H5 was integrated

and flown under the management and direction of the Department of Defense Space Test

Program Human Spaceflight Payloads Office.

ISS SpaceCube Experiment Mini (ISEM)

The CSP flight experiment (STP-H5/CSP) is included as a secondary module in the ISS

SpaceCube Mini Experiment (STP-H5/ISEM) developed by NASA Goddard’s Science Data

Processing Branch. One of the most recognizable contributions the branch has made to space

development is the successful design of SpaceCube, a family of high-performance reconfigurable

systems, which has inspired several design aspects of the CSPv1. SpaceCube has been featured

as the prominent technology on several missions including the Hubble Servicing Mission 4,

MISSE-7, and STP-H4 [55]. The ISEM experiment on STP-H5 focuses on SpaceCube Mini [56],

40

which serves as the primary communication bus for some of the DoD payloads, as well as STP-

H5/CSP. The ISEM 3D model and assembly is depicted in Figure 2-7 and Figure 2-8,

respectively, and displays the Electro-Hydro Dynamic (EHD) thermal fluid pump experiment,

and the Fabry-Perot Spectrometer (FPS) for atmospheric methane. The connection diagram for

ISEM is illustrated in Figure 2-9.

Figure 2-7. STP-H5/ISEM flight box 3D model.

41

Figure 2-8. STP-H5/ISEM fully integrated payload.

Figure 2-9. STP-H5/ISEM card block diagram.

42

CHAPTER 3 SMALL SPACECRAFT COMPUTING

One of the primary motivators for the development of the hybrid space-computing

concept developed in this dissertation is the current focus of the community on small satellites

and small spacecraft. Small Satellites are diverse platforms that can contain a wide variety of

sensors, electronics, and deployables; however, a unifying common denominator that they all

must include is a computing or avionics system. SmallSat computing is widely varied and can

range from small microcontrollers to powerful microprocessors. Since SmallSat missions accept

higher risk than traditional government-funded missions, space developers have been encouraged

to create computing technology that is more affordable, reliable, and high-performance. This

exploration into designs that are not fully rad-hard have afforded research to create new concepts

such as the hybrid architecture featured in the CSPv1. This chapter is dedicated to further

describing the historical trend development towards SmallSats, defining the current state-of-the-

art, comparing SmallSat computing against traditional satellite computing, and finally

highlighting the challenges SmallSat computing faces.

SmallSats and CubeSats Overview

The rise of SmallSats can be traced to the interactions between several prominent space

organizations. In 2007 the NRC, at the request of several organizations including the National

Aeronautics and Space Administration (NASA), the National Oceanic and Atmospheric

Administration (NOAA), the National Environmental Satellite Data and Information Service

(NESDIS), and U.S. Geological Survey (USGS) Geography Division, conducted and published a

study (“2007 decadal survey”) on Earth observations from space to identify short-term needs and

longer-term scientific goals of importance [6]. In 2012, the NRC published a follow-up study

(“midterm assessment”) describing how key organizations were meeting the recommendations of

43

the original survey [5]. From an Earth-observation perspective, there were two key findings

driving SmallSat development. The first finding described that the nation’s Earth-observing

capabilities have begun a rapid decline as several long-running missions were ending and

essential new missions were delayed, lost, or canceled. The NRC also found that NOAA’s ability

to meet science needs had greatly diminished due to budget shortfalls, cost overruns, and delays.

Secondly, the report identified the need for alternative platforms and flight formations to offer

programmatic flexibility and lower the costs of meeting mission requirements and objectives.

The U.S. Government Accountability Office (GAO), an office that identifies government

agencies and programs that are high risk, further emphasized the critical need for new, lower-

cost platforms. Out of 34 total high-risk areas in 2017, the only “science and technology topic”

was “Mitigating Gaps in Weather Satellite Data” describing the scenario [57] feared in the

midterm assessment.

Due to these highlighted challenges, SmallSats have flourished as a technology platform.

Within these constraining fiscal environments, relevant agencies, organizations, and missions are

forced to achieve compelling science at lower cost and faster schedule. The underlying

motivation driving SmallSats as a technology is encapsulated with the concept “do more with

less.” NASA and relevant organizations see value in SmallSats for a variety of reasons.

SmallSats benefit from comparatively lower development costs, miniaturized electronics, and

more easily accessible and affordable launch opportunities. SmallSats can also perform several

key functions. First, SmallSats can be used as technology demonstrations, providing

opportunities for new technology to be tested at no risk to larger programs and help to more

quickly reduce the time required to advance the state-of-the-art. SmallSats also provide unique

science opportunities that cannot be achieved by a single spacecraft, such as multi-point

44

measurements in a constellation or swarm of SmallSats. Constellations of lower-cost spacecraft

increase reliability and capability of a mission, since failed spacecraft can be quickly replaced.

Finally, it has been suggested in [7] that CubeSats and SmallSats have the potential to mitigate

data gaps, such as the gap described by GAO, allowing for sustained measurements in the short

term, due to their shorter development cycles.

Michael Johnson, the NASA Chief Technologist of the Applied Engineering and

Technology Directorate, described NASA interest in SmallSats [58] as follows:

The capabilities of miniaturized systems are rapidly increasing while the resources

(mass, volume, power) they require are decreasing. At the same time, NASA’s

fiscal environment motivates competitive projects and missions to achieve

compelling science at lower cost and schedule than usual. We see small spaceflight

instruments hosted by small spacecraft as a potential response to this challenge.

SmallSat Technology State of the Art

To understand the benefits of the new hybrid-computing design, it is imperative to study

the currently available technology, which is summarized in a report commissioned by NASA’s

Small Spacecraft Technology Program (SSTP). This report [59], originally transcribed in 2013,

was created in response to the growing impact and interest in using small spacecraft, and served

to to assess key technology domains of spacecraft with mass below 180 kg. The report, however,

states the bias of presenting a high emphasis on CubeSat-related technology, over SmallSats in

general, due to the high market interest in CubeSats. The report describes two primary trends

driving the requirements for command and data handling on small spacecraft. The first trend is

the desire to introduce more complex science and technology applications, which requires high

system reliability and performance. The second trend is a desire to take advantage of the low-

cost, easy-to-build, accessible CubeSat development, primarily targeting hobbyists and

university programs without extensive experience on spacecraft development.

45

In the onboard-computing section of the report, NASA observes the proliferation of

microcontroller options due to the broadening number of CubeSat developers. The report

compiles a list of vendor-supplied, onboard-computing solutions which, in addition to

microcontrollers, contains SoCs, DSPs, and FPGAs. Table A-1 extends the list in the SSTP

report [59] for vendors of CubeSat and other SmallSat single-board computers (SBC), along with

missions upon which these devices were launched as reference. This table should not be

considered an authoritative, comprehensive database of every vendor; however, it serves to

provide a representation of the community. This list was extended through data supplied directly

from vendors, datasheets, literary references, and personal communication. It should be noted

that, since the list relies largely on publications, it will not account for changes in designs

between publication and launch. In addition, several popular vendors would not disclose specific

devices, due to the competition-sensitive nature of sales, and therefore are not reflected here

(e.g., Blue Canyon). There were many vendors contacted, and several did not respond, so some

frequently referenced designs or information is missing (e.g., Hyperion Technologies,

Endurosat). Finally, no entry in the mission column does not indicate that there is no flight

heritage, since some vendors could not release mission details, and many mission publications do

not cite adequate detail for specific devices or SBCs to be included. Some missions cited are not

SmallSats, however, this case does not preclude SmallSat missions from using a specific device.

46

SmallSat Computing vs. Traditional Spacecraft Computing

Flagship satellite missions primarily rely upon rad-hard devices to safeguard electronics

from failing, since these missions are vital and expensive. Common rad-hard processors on

recent missions include the Synova, Inc. Mongoose-V (New Horizons), BAE RAD6000

(DSCOVR), BAE RAD750 (GPM, JWST, Curiosity Rover), and Cobham Gaisler LEON-3FT

(Hayabusa2), which have extensive flight heritage. The RAD750 is emphasized as a state-of-the-

art flight device, comes in standardized CompactPCI (cPCI) 3U or 6U sizes [60], and consumes

a total power of 5W [61]. Notably, these devices are based on much older designs than current

commercial devices due to the considerable financial and schedule investment required to

develop new rad-hard products. Using the device-metrics approach described by Lovelly et al.

[61], Fig. 2 shows the performance normalized by power consumption of selected devices of

interest: microcontrollers (blue); rad-hard (red); microprocessors (black); FPGAs (green); and

SoCs (Purple). This figure illustrates several key outcomes. First, as expected, standard

microcontrollers have negligible performance compared to other device categories. The chart

also highlights the poor performance of rad-hard processors compared to commercial devices.

Finally, the figure displays the vast performance advantages to be gained with SoCs. Due to the

difficulty of obtaining device information, several assumptions regarding device operations had

to be made for Figure 3-1. An example of an assumption made is the number of operations per

cycle for 8 and 16-bit integers if not explicitly stated. Additionally, the BCM2835 was scaled by

board power instead of the expected device power (information not available). Additional

performance analysis on the capability of other rad-hard devices is presented in Lovelly et al.

[61]. Another study conducted in 2012 by Ramon chips [62] compares both rad-hard devices and

commercial devices augmented with fault-tolerant strategies.

47

Figure 3-1. Performance scaled by power comparison of onboard processors.

Due to the cost of rad-hard devices, mission budget is the motivating consideration

between SmallSat computing and traditional spacecraft computing. Figure 3-2 displays the cost

of several commercial SBCs, where prices were easily identifiable. It should be noted that, for a

large number of vendors, SBC prices require a quote or non-disclosure agreement and therefore

are not included in this figure. This chart emphasizes the difference between these commercially

available devices and rad-hard boards that can cost orders of magnitude more than some

commercial options.

SmallSat missions may lack the budget to include all rad-hard electronics; however, they

are excellent platforms for new-technology demonstration. The primary benefit for SmallSats

from a computing perspective stems from the reuse of devices on SmallSats in larger-mission

satellites. The SmallSat state-of-the-art report [59] cites CompactPCI as a common SmallSat bus,

and shows that common SmallSat power solutions can also support the power profiles of some

48

rad-hard devices. A use case is demonstrated in Rodgers et al. [63], with engineers from

Information Sciences Institute seeking to fly an experimental, multi-core, rad-hard processor to

be validated on the small NovaWorks platform.

Perhaps the most overt demonstration of using any commercial electronics was observed

in a precedent set by NASA Ames Research Center and the PhoneSat program. In these

experiments, NASA demonstrated that they could fly common cellphones (Nexus One, Nexus S

smartphones) and basic electronics in space for a short period [64].

Figure 3-2. Costs of commercially available SBCs.

Challenges to SmallSat Computing

The challenges of SmallSat (including CubeSat) computing are largely related to the

challenges faced by SmallSats as a development platform. SmallSats, compared to large

satellites, have reduced size, weight, power, cost, and volume. These requirements also restrict

the capabilities of a single-board computer. General CubeSat trends and failures are considered

by M. Swartout’s presentations analyzing the St. Louis University CubeSat database [65].

$- $1,000 $2,000 $3,000 $4,000 $5,000 $6,000

Rasberry Pi

Beagalbone Black

GumStix Overo EarthSTORM

NanoSatisfi Inc. ArduSat Kit

Pumpkin PPMs

ISIS OBC

CubeSpace CubeComputer

49

General SmallSat challenges have also been addressed by NASA in the Small Spacecraft

Reliability Initiative [66].

The primary challenge facing SmallSat computing resides in the use of commercial

processors. Since commercial processors are not hardened for radiation, they are affected by

radiation effects as previously described. In addition, modern SoCs and FPGAs are complex

devices that contain additional IP blocks such as on-chip memory, clock management, and

interface controllers. These different components all require separate radiation tests to determine

each individual component’s error modes and upset rates. This realization further complicates

the design process, as radiation testing is both time-consuming to plan effective tests and conduct

actual testing. Lastly, radiation testing is very expensive, which prohibits many organizations

from testing the devices they fly.

Commercial devices are constantly pushing the bounds of new technology and interact

with radiation in different ways, occasionally exhibiting new effects never before observed on

other devices. In Lee et al. [67], the authors describe an unconventional single-event latch-up,

also called “micro latchup,” that was discovered in new FPGA technology used for flight

missions.

University programs and hobbyists rely on commercial off-the-shelf CubeSat kits, both

for convenience and simplicity of development. Steven Guertin at NASA Jet Propulsion Lab

(JPL) has been conducting studies on common CubeSat microcontrollers and microprocessors

found in CubeSat kits, starting with his initial report in 2014 [68], with follow-up reports each

year at NEPP. His testing reveals that, while relatively resilient to TID for Low Earth Orbit

(LEO), most of these CubeSat kit devices show significant problems caused by latch-up. These

results do not guarantee that a device will fail; however, they highlight that during a low-

50

probability event in LEO, the device may suffer from significant issues. The Air Force Research

Laboratory has also conducted independent testing on common commercial kits in Avery et al.

[69].

The last challenge, highlighted by NASA Goddard’s presentation on Clagett et al. [70], is

that certain tasks such as flight software, communications, ground systems, and attitude control

system, fundamentally require the same functions as larger spacecraft with comparable analysis

and testing. Their mission made compromises to sensor data acquisition to perform all the

desired flight-software processing onboard with the selected microcontroller.

Better Computing with Hybrid Approach

Next-generation spacecraft missions seek to accomplish even more significant science

and defense objectives with SmallSats. New missions are proposed for more-challenging

radiation environments than LEO, including Lunar, Mars, and deep space. To accomplish these

objectives, computing will have to achieve a sufficiently high level of both performance and

radiation reliability.

Dylan et al. [12] and the subsequent chapter proposes a multifaceted, hybrid-design

methodology to achieve the benefits of both commercial and rad-hard designs. This approach

proposes a hybrid-system architecture, where commercial technology is featured for high

performance and energy efficiency, while the device is supported and managed by rad-hard

components for increased reliability. Additionally, the reliability is bolstered by fault-tolerant

computing strategies applied atop the commercial device. This hybrid approach also describes

use of a hybrid device (e.g., CPU+FPGA SoC) as the featured commercial processor to

maximize performance by optimizing algorithms based upon architecture needs.

51

CHAPTER 4 CONCEPTS OF HYBRID, RECONFIGURABLE SPACE COMPUTING

The primary contribution of this dissertation is to introduce the previously mentioned

hybrid-computing concept for space computers. This novel computing-design philosophy,

known as “CSP the concept” is a concept for a multifaceted, hybrid-processing, space system.

This concept centers on having both a hybrid-processor and hybrid-system architecture. A

hybrid-processor device with mixed technology, also known as System-on-Chip (SoC) device,

can achieve immense computational benefits depending on an algorithm’s structure. For

example, with a mixed FPGA+CPU combination, a parallel algorithm can be hardware-

accelerated on the FPGA fabric, while control-flow operations can be performed on the CPU

cores.

The CSP concept also features a hybrid-system architecture, which is a combination of

three themes: commercial-off-the-shelf (COTS) devices; rad-hard devices; and fault-tolerant

computing strategies. Commercial devices have the energy and performance benefits of the latest

technology advancements, but are susceptible to radiation in space, whereas rad-hard devices are

relatively immune to radiation, but are more expensive, larger, hard to procure, and outdated in

both performance and functionality. The keystone principle of the CSP concept is to include a

device with commercial technology featured for the best in high performance and energy

efficiency, but supported by rad-hard devices monitoring and managing the commercial devices,

and further augmented by strategies in fault-tolerant computer architecture (FTCA). This concept

is illustrated in Figure 4-1.

52

Figure 4-1. Illustrated CSP concept diagram.

53

CHAPTER 5 RELIABILITY METHODOLOGY FOR SMALLSAT COMPUTERS

In literature, there is no straightforward method to follow to predict the failure and upset

rates of any given single-board computer in a given Earth-centric orbit. In order to be able to

determine the reliability of a design and compare the given design to other configurations and

other boards, we developed a methodology [71] for estimating the reliability of SmallSat

computers in radiation environments. This new methodology is built upon established reliability

techniques and including PRA concepts to reflect overall reliability (and other measures) of the

space-computer system due to radiation effects as quantifiable values. Figure 5-1 depicts an

overview of the methodology, which consists of four key stages.

Figure 5-1. Reliability methodology stages.

54

Methodology Stages

To provide step-by-step examples of the methodology in use, a configurable space-

computer board was selected and analysis was performed as a case study. This board is a multi-

faceted, hybrid computer called the CSPv1. The following sections describe the methodology,

using several case-study examples to illustrate the process.

Stage 1: Component Analysis

The first stage of the methodology is to compile a list of all EEE components that

constitute the current or proposed board design. This stage is relatively simple, but sets the

foundation for the rest of the analysis, because the engineer should become familiar with

different characteristics of the components. Once the list of EEE components is collected, each

component should be then classified into device family (Processor, Memory, Analog, Digital,

Power, Mixed Signal, etc.), feature size, process type, and function. It is important to have this

information in advance of the analysis, since each of these characteristics helps define a

component’s response to radiation. Several resources and tools are available to help examine

radiation effects by component. One prominent tool is the NASA CubeSat Radiation tool by

NASA Goddard which compiles a list of families of each device and their susceptible SEE

effects. Finally, the reliability engineer should consider the depth of the analysis to be performed

for the mission, and select components for the final analysis. For example, in some missions it

may not be necessary to include analysis for passive components (resistors, capacitors, etc.) or

some simple analog components, and analysis is only performed on active components.

Stage 2: Radiation Data Collection

Once the list of key components is formulated, a broad search must be conducted to

collect all available radiation data for each component, focusing on data relating to effects

specified by the device family. This radiation data can be acquired from many sources including

55

manufacturer datasheets, independent testing publications, IEEE Nuclear and Space Radiation

Effects Conference (NSREC) proceedings, or most commonly the NASA Goddard Radiation

Database1.

The key focus in this stage is to examine each desired component in the design and

determine if it should be used in the final mission or design. In this stage, we can employ RHA

and SEECA to examine the component’s risk. Due to the expansive number of existing EEE

components compared with the number of EEE components that have been flown or have

radiation data, it is unlikely that the exact desired component exists in any publically available

database. Without access to internal databases from large organizations, it is difficult to acquire

actual mission data, so the next best data is from archived radiation testing.

If a part has radiation test data that is valid for the given mission parameters, then there is

no more work to be done in this stage. If a part has no radiation data, or responds poorly to

radiation effects, then the system designers will have to decide if the part should still be used. If

it is decided that the part will be used in the design, but has no radiation test data (accepting

risk), then for the purposes of the system-level analysis, suitable data will need to be input. This

suitable data will typically be previous archival radiation-test data that comes from similar

components to the original device that have already been tested. LaBel et al. [20] offers guidance

and commentary on how representative the data pulled from archives can be to the real data, as

well as several recommendations for this type of procedure. Ladbury [30] gives several

suggestions on how to pick the next best data to use for the analysis and is illustrated in Figure

5-2. In the ideal best-case scenario, there will be representative flight-lot specific data. Since this

scenario is unlikely with newer COTS components, the next closest representative data should be

1 https://radhome.gsfc.nasa.gov/radhome/raddatabase/raddatabase.html

56

selected as illustrated in Figure 5-2. Once a device has been selected, we refer to the device data

that is used in the analysis as the Radiation Tested Replacement Part (RTRP).

Figure 5-2. Statistical structure of representative data [30].

There are two main goals from this stage of the data collection. The first is to obtain a

Weibull curve describing the Cross Section vs. Effective Linear Energy Transform (LET) curves

for every component’s relevant SEEs. Figure 5-3 from Oldham et al. [72] shows example points

that will be used to generate a Weibull curve for a non-volatile memory component used in the

case study. These curves serve as inputs into a mission simulator (like CREME96 or SPENVIS)

to predict error rates for each type of SEE. In some scenarios, the actual data values may not be

provided and the reference may only provide a chart. In these scenarios, MATLAB is used to

generate a best estimation for the Weibull curve. The best Weibull model fit is calculated by

estimating key points on the chart visually and having MATLAB perform an automated least

squares regression. The second goal of this stage is to acquire a TID value for each component,

All relevant data (ARD)Physics

Technology TrendsPart GenerationExpert Opinion

Similar PartsSame Fabrication Facility

Same ProcessSimilar Design Rules

Historical DataSame Part Type, #

Flight Lot

Flight Part

VariabilityM

ean

57

which will typically be recorded in krads. This number will help in the future stages to determine

component survivability in the mission environment.

Figure 5-3. Example cross section vs. LET graph [72].

Stage 3: Mission and Model Parameter Entry

For the third stage of the methodology, the data collected in the first and second stages is

used with specific mission characteristics (such as orbit) that define the mission environment and

is entered into tools used for SEE and TID prediction rates for that environment. Key tools for

this type of analysis include CREME962 and SPENVIS3, which can be used to estimate the

expected SEE and TID respectively for the components within the mission specifications. Table

5-1. SEU upset rates for non-volatile memory reported by CREME96. provides an example of

expected output results from CREME96. This table displays the specific SEU upset rates for a

non-volatile memory used in the CSP case study. Table 5-2 displays a subset of outputs for

2 https://creme.isde.vanderbilt.edu/

3 https://www.spenvis.oma.be/

58

SPENVIS, with the specific values for a year in the same low-Earth orbit (LEO) used in the case

study. Here, SPENVIS is used to calculate TID because CREME96 does not take into account

the additional fault rate from trapped protons, while CREME96 is used to calculate SEEs. A

detailed description of CREME96 functionality (including a walkthrough for configuring it) is

presented in Engel et al. [74].

Table 5-1. SEU upset rates for non-volatile memory reported by CREME96.

Type Rate

SEEs/bit/second 1.08E-24

/bit/day 9.30E-20

/device/second 8.61422E-15

/device/day 7.44268E-10

Table 5-2. Typical TID amounts for LEO with 1-year mission reported by SPENVIS

Al (mils) Total (rads) Trapped Electrons

(rads)

Brems-

Strahlung (rads)

Trapped Protons

(rads)

1.968 6.140E4 5.850E4 1.070E2 2.800E3

98.425 2.906E2 1.963E2 1.778E0 9.255E1

196.850 9.858E1 2.711E1 8.719E-1 7.059E1

787.400 4.146E1 0.000E0 2.771E-1 4.119E1

Certain components may only have results from proton testing, or only have heavy-ion

data and need results for protons (in LEO, upsets are dominated by trapped proton upsets). In this

scenario, we consult the method presented by Barak et al. [75] and Petersen [76]. These papers

explain how to use the Figure of Merit (FOM) approach to estimate the missing SEU rates based

on known data from a particular cross section. More concisely, FOM explains how to predict the

heavy-ion upset rate if the cross section for protons is known and vice versa. Once the missing

rates have been calculated, such information is also entered into the tools.

Stage 4: Fault-Tree Construction, Iteration, and Modification

The final stage is to construct the DFT from a study of the computer architecture as well

as component interactions, board schematic, and layout. The main goal is to devise a DFT that

59

represents the failure sequences of the system (as mentioned previously, the accuracy of this

model is dependent on the competency of the designer). As described in the second stage, there

should be a basic fault event for each of the applicable SEE types to the component. A basic fault

event is pictured in Figure 5-4, and is where the SEE fault rates from CREME are entered. In

Figure 5-4, the heavy-ion upset rate (HUP) and proton upset rate (PUP) are basic events for the

non-volatile memory we have been using as an example. Windchill Predictions displays

unreliability (Q) of the component at a fixed point in time, which is set to 24 hours for this study.

The fault rate must be converted from faults/upsets per day as provided by CREME to

faults/upsets per billion hours (109), known as Failures in Time (FIT). This fault tree is

constructed with the PTC Windchill Predictions (formerly Relex Reliability Prediction) software,

a recommended tool for NASA reliability calculations, for both computation and analysis. This

methodology is not limited to this specific software and can be used with any fault-tree tool as

long as the system design can be accurately reflected. Windchill Predictions is relatively easy-to-

use and includes several DFT gates in the toolset [38]. Some key modules that could have

extended fault trees depending on the board are listed below:

• Microprocessor Failure

• Passive Component Failure (Resistors etc.)

• Programming Circuitry Failure

• Supervisory Circuit Failure

• Timing Reference Failure

• Memory Failure

• Transmitter / Receiver Failure

In this methodology, each of these key design modules should be considered. Figure 5-5

illustrates the top-level hierarchy of the CSP case study with transfer gates to each of the

described modules, which are expanded into their own fault trees. In our case study, we have

elected to focus on the microprocessor (Zynq), memory, and power regulation modules. For

60

reference, parts of the case-study memory module are illustrated in figures here. The memory

module transfer gate (shown in dashed box) in Figure 5-5 is expanded in Figure 5-6. Figure 5-6

shows that a memory failure can result from volatile or non-volatile memory. Within that

memory module, the non-volatile memory transfer gate (shown in dashed box) is expanded in

Figure 5-7. Figure 5-7 is the fault tree for the NAND flash memory used in the case study. The

fault tree illustrates a particle strike causing a SEFI or SEU (as shown in Figure 5-4) and the

NAND flash failing due to usage (wear). Some parts of the fault tree are specific to the case-

study design. Calculations for an upset in the boot partition of the NAND flash are evaluated in a

different fault tree. Additionally, there is an inhibit gate entry to reflect that, in this design, a

failure of the NAND flash will not cause the board to fail unless the processor restarts. If the

processor is currently running, it would just note that the NAND flash was disabled and continue

nominal operation.

Figure 5-4. Basic event for a SEU to memory cell in non-volatile memory from heavy ions or

trapped protons.

61

Figure 5-5. System-level fault tree with key modules for analysis.

This fault-tree structure can have variable granularity, expanding into a more full-detailed

analysis (by having a more complex fault tree or Markov model), as necessary. This structure

allows designers to modify the tree if more data becomes available, or add in more intricate

fault-tolerant techniques to test the effects on the system. This constructed DFT would represent

the basic system design and is the baseline for comparison to other modifications.

Figure 5-6. Expanded memory module.

The final step is to refine the DFT based on hardware or software fault-tolerant

computing techniques selected for the system. For particularly complex processor or component

interaction, a Markov model can be constructed in its place if necessary and the PTC tool can

dynamically link the Markov model into the fault tree. This DFT gives the total board design

failure as quantifiable values which reflect the overall reliability of the system including added

fault-tolerant capabilities to combat radiation effects. Figure 5-8 shows the same non-volatile

62

memory module structure in Figure 5-7, but enhanced with error-correcting code (ECC) with an

inhibit gate (shown in dashed box).

Windchill Predictions can calculate different reliability measures for the top-level gate

(processor failure) once the system fault tree has been constructed and all fault rates have been

entered as basic events. The calculator takes time and number of data points as inputs and can

calculate unreliability, failure rate, frequency, and number of failures. From these calculated

metrics other reliability measures can be derived, such as mean time to failure and upset rate per

day. Lastly, the tool can export all its results to a Microsoft Excel spreadsheet to be used in any

other analysis as desired. Figure 5-9 shows a graph generated by the Windchill Predictions tool

of board failure from the case study, in terms of unreliability vs. a 24-hour timeframe.

Reliability measures are important for building a baseline to allow comparisons of the

same board with modified parts or fault-tolerance strategies or with other space-computer

hardware and software configurations. These values allow us to specifically compare different

component configurations (all-commercial, hybrid, all-rad-hard design) to determine the amount

of reliability gained from additional fault-tolerant components, as well as the associated

monetary cost for extra reliability. This same strategy can be deployed across the same board

with different software fault-tolerance strategies through appropriate fault tree or Markov model

additions. The fault tree can only account for SEE effects and is plotted in an unreliability vs.

time graph, which will be referenced when accounting for TID.

63

Figure 5-7. Expanded non-volatile memory section.

TID cannot be properly reflected in the fault tree due to configuration limitations in

Windchill Predictions. After obtaining the TID information by entering mission specific

parameters into SPENVIS, the survival duration for each component is calculated. Using the

fault-tree structure to determine which component failures are survivable, the time until failure

due to TID can be calculated. In the simplest scenario, if no components can fail without causing

the entire computer to fail, the survival time due to TID is the time until failure for the

component with the lowest TID. This calculated time to fail due to TID is then assumed to be the

maximum time for the analysis so the unreliability vs. time graph for SEEs ends at this

calculated time.

64

Figure 5-8. Non-volatile memory module with ECC.

A modified approach is required in a more complex scenario where, due to fault

tolerance, a system can survive certain component TID failures. When a component fails and is

removed from the system, this changes the fault-tree structure of the system and by extension its

reliability. To properly account for this change, a new fault tree is created with the component

removed. This change creates a discontinuity in the original graph, so the new graph will look

like a piecewise function, where the original fault tree is used up to the time where the

65

component should fail, then the new fault tree is used from this point onward to reflect the

changes in the system.

While this research is not encompassing of all radiation analysis techniques, it still

provides the reliability engineer with a practical method to model and compare different space-

computer designs and study the tradeoffs. Eventually, we hope to expand this model to reflect

other metrics including performance and availability, and other forms of radiation analysis.

Figure 5-9. Graph generated by Windchill Predictions for case study board failure.

Mitigation Guidelines

The methodology expresses an iterative process where the design is analyzed then

modified, repeatedly. This paper does not cover different mitigation strategies or how to model

them in a fault tree or Markov model, however, some suggestions are provided with additional

methods described by Foucard [77]. For failures due to TID, spot/sector shielding can be used to

provide some protection. If unacceptable fault rates are generated from SEEs, then components

can be up-selected to a higher-grade component or more system redundancy can be included.

66

CSPv1 Analysis

CSPv1 provides a unique example of a design that is configurable, and it also serves as a

useful case study for deploying the new methodology. The most useful feature of this processor

for this analysis is that it has a selective population scheme for several components. This scheme

allows certain components of the board to have both commercial and radiation-hardened

footprints to populate the design. This approach allows the user to scale reliability and cost by

selecting different components. For the case study, an all-commercial variant of the CSPv1 is

compared with a CSPv1 that has all the available rad-hard footprints populated (hybrid CSPv1).

Case Study: Description and Assumptions

For this case study, the methodology steps were completed for the two CSP designs. DFT

models were constructed for the COTS variant and hybrid variant that included the rad-hard

components. The full DFT diagram is too large to be reasonably and coherently displayed in this

paper, but the general structure for a module has already been illustrated with Figure 5-4 to

Figure 5-8. Each component of the CSPv1 was analyzed and the fault rates by SEE type were

entered as basic events in the DFT as described by Figure 5-4. Finally, these fault rates and

relevant data were collected for analysis for both boards in two different orbits: Low-Earth Orbit

(LEO) and Geostationary-Earth Orbit (GEO).

This study assumes 98.425 mils of aluminum shielding. The representative LEO orbit for

this study is the International Space Station orbit, while the representative GEO orbit is the

AMC-18 satellite orbit. The DFT models were constructed without any additional fault tolerance

and represent the basic system. Finally, it should be noted that there was no available radiation

test data for several commercial components. In these cases, the best estimate was based on

available data and the RTRP selections as described in the methodology section.

67

Lastly, it should be noted that there is a discrepancy between vendors’ provided radiation

data and commercial component data. In studying this issue, engineers discovered that a

commercial NAND flash obtained better results than reported by vendors for the radiation-

hardened counterpart. One reason for this discrepancy could be vendors reporting lower numbers

to keep within acceptable manufacturer-guaranteed ranges, which may be below the actual

capability. In these situations, the radiation-hardened variant is expected to perform better than

the reported data suggests. Therefore, for this case study, if the COTS fault rates were lower than

the radiation-hardened fault rates, then the radiation-hardened numbers used for the analysis

were increased to be at least equivalent to the COTS numbers.

Case Study: Results and Analysis

For survivability and lifetime results, mission-specific parameters were placed into

SPENVIS for both LEO and GEO environments, and the overall expected TID was generated for

a year (Table 5-3). For the design, no components are able to fail without causing a complete

board failure, therefore the lowest TID of the available components is compared to the overall

expected TID and a simple ratio calculation gives the amount of time until the component fails.

These results are reflected in Table 5-4.

Table 5-3. Yearly TID by orbit.

Orbit Expected TID

LEO 0.29 krad/year

GEO 71.3 krad/year

68

Table 5-4. Estimated board lifetime.

Configuration Orbit Lifetime

CSP (Either Configuration) LEO ~10+ Years

CSP-COTS GEO ~100 Days

CSP-Hybrid GEO ~200 Days

For SEE and transient upset results, DFTs were constructed for both configurations of the

board and reliability measures were generated by Windchill Predictions. Windchill Predictions

also has the capability to calculate results for all intermediate gates within the system fault tree,

so certain modules can be explored. The most interesting module for this comparison is the

power-system module, since this module varies the most between our two case-study boards

(i.e., the hybrid CSPv1 has rad-hard power regulation components).

Several main observations can be drawn from this study, which demonstrate the

usefulness of the methodology. First, after examining the upset rates of the submodules of the

fault tree, the system upset rate is primarily dominated by common components (Zynq, DDR) in

both the COTS and hybrid variations, so both boards will have similar upset rates reported in any

orbit. Since the results are similar between both boards, Table 5-5 shows the expected upset rate

for each of the studied orbits without differentiating between configurations. This finding is

displayed in Figure 5-10, which contains the reliability curves in both orbits for the boards, as

well as the Zynq and DDR components for comparison.

While the overall system reliabilities are similar, Figure 5-11 shows the reliability of the power

modules in both GEO and LEO orbits. These results show differences between the COTS and

rad-hard components in both LEO and GEO. A comparison of the failure rates of these

components is provided in Table 5-6.

69

Table 5-5. CSPv1 board upset rate.

Computer Orbit Upsets/Day

CSP (Either Configuration) LEO 1.9797

CSP (Either Configuration) GEO 16.235

Figure 5-10. LEO and GEO reliability curves.

Key findings show that SEE upset rates for each board configuration were dominated by

the same COTS components between the board configurations. The rad-hard components,

however, are still useful because they are more resilient to cumulative radiation effects, which

improves the system’s lifetime, even though they have only a minor contribution to improving

SEE upset rate.

We can observe several significant observations while employing the defined

methodology. The results show that since Zynq and DDR components of the board have the

highest upset rates, therefore SEE upset rates between configurations is minimal. This finding

shows weaknesses in the design that can be improved by adding fault-tolerant computing

techniques. In this example, the Zynq can be further mitigated using well known techniques such

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20

Rel

iab

ilit

y

Time (Hours)

LEO-Board GEO-BoardLEO-Zynq GEO-ZynqLEO-DDR GEO-DDR

70

as configuration scrubbing and triple-modular redundancy structures. The DDR could be further

mitigated with ECC. This analysis shows the designer which components to focus on to improve

reliability. This analysis also shows that in LEO the rad-hard parts may not be necessary and a

commercial board can be deployed, thereby reducing costs. Lastly, the process in the

methodology highlights information about the environment with which a newer designer may be

unaware, such as the much harsher lifetime and upset rates found in GEO, when compared to the

relatively benign LEO.

Table 5-6. Power system upset/day.

Orbit CSP-COT CSP-Hybrid

LEO 1.713E-03 9.0147E-06

GEO 0.0014 2.6104E-06

Figure 5-11. Power module reliability.

Methodology Insights and Improvements

This section presents a practical methodology to determine and evaluate radiation-

oriented reliability characteristics for space computers from a system-level SmallSat perspective.

This methodology can help designers in gauging the general level of reliability of their design,

0.9984

0.9986

0.9988

0.999

0.9992

0.9994

0.9996

0.9998

1

0 5 10 15 20

Rel

iab

ilit

y

Time (Hours)

ISS-COTS ISS-Hybrid

GEO-COTS GEO-Hybrid

71

comparing its reliability against other designs, deciding on component selection during the

development phase, and evaluating effectiveness of hardware and software fault-tolerance

mechanisms in the design. Our methodology is relevant, even though it has not been validated by

a multitude of radiation tests and comparisons, because it builds on established and widely

accepted methods and techniques, and it combines them to provide an initial analysis of a design.

Additionally, the soundness of this approach has been reviewed with radiation experts at NASA

Goddard and has been approved, noting that assumptions should be clearly stated and limitations

expressed to prevent any unintentional misuse. In this paper, we explored different

configurations of the CSPv1 space computer and evaluated configurations under different

environmental conditions. This methodology has illustrated potential issues in the board design

that can be addressed with fault tolerance. Finally, this study has provided an initial first-order

estimation of both the survivability and expected upset rates of these board configurations.

The methodology established in this paper can be further expanded to cover more

advanced types of analysis and provide even more accurate predictions. CSPv1 has already been

exposed to neutron-beam testing in both commercial and hybrid configurations. Preliminary

impressions of the neutron test results are expected to confirm predictions examined in this

paper. Further analysis will be performed when the results of those tests are finalized. Additional

topics for future study are listed below:

• Include explicit instructions and descriptions for analysis within a spacecraft using ray

tracing in conjunction with University of Wisconsin-Madison’s Direct Accelerated

Geometry Monte Carlo Toolkit (DAGMC). This method would allow exploration of

modeling of components related to physical location within the board and within the

spacecraft.

• Provide further examples with different fault-tolerant computing techniques employed

within the DFT model.

• Expand the methodology and provide example models to add performance and

availability metrics.

72

CHAPTER 6 CSPv1 DESIGN

The reliability methodology was specifically created to help analyze and design the

CSPv1 board. CSPv1 is the first flight board to evolve from the CSP concept and features a

hybrid-processor and hybrid-system architecture. The processor architecture features fixed (dual

ARM Cortex-A9/NEON cores) and reconfigurable (28 nm Artix-7 FPGA fabric) logic on the

Xilinx Zynq-7020 SoC device. The system architecture combines commercial and rad-hard

electronics with an assortment of techniques in fault-tolerant computing to achieve a system with

a powerful combination of high speed and reliability with low power, size, weight, and cost

(SWaP-C).

Hardware Architecture

Some specifications of the Xilinx Zynq-7020 device used by CSPv1 are provided in

Table 6-1 and Table 6-2, a block-level diagram is provided in Figure 6-1. Attached to the ARM

side of this device, the CSPv1 can support up to 1 GB of DDR3 memory (maximum capacity

supported by the DDR3 controller).

The CSPv1 is designed to fit a 1U standard CubeSat form factor (10 cm × 10 cm). All

external connections to the CSPv1 board are made through a 160-pin Samtec Searay connector.

There are 60 connections from the FPGA side of the Zynq, where 48 pins can be configured as

24 differential pairs for high-speed interfaces. There are also 26 high-speed connections from the

ARM side of the Zynq that can be configured in a combination of varying communication

interfaces including UART, I2C, and SPI.

73

Table 6-1. Xilinx Zynq-7020 ARM specifications1.

ARM Specifications

L1 Cache Per Core 32 KB Instruction / 32 KB Data

L2 Cache Shared 512 KB

On-Chip Memory 256 KB

Clock Frequency 667 MHz (-1 Speed Grade)

Table 6-2. Xilinx Zynq-7020 FPGA specifications

FPGA Specifications

Programmable Logic Cells 85,000

Look-Up Tables 53,200

Flip-Flops 106,400

Block RAM / # 36 Kb Blocks 4.9 Mb / 140

DSP Slices 220 (18 x 25 MACCs)

Figure 6-1. CSPv1 Rev B. block diagram.

1 https://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf

74

The CSPv1 Rev. B has a unique selective population scheme for several components that

enables certain components of the board to include both commercial and rad-hard PCB footprints

to populate in the design. This approach enables the user to scale reliability and cost by selecting

different components for mission needs. Figure 6-2a shows a populated board using all

commercial components (CSPv1 Rev. B Engineering Model or CSPv1 Rev. B EM), where the

unpopulated regions are for the placement of equivalent rad-hard components. Figure 6-2b shows

the populated board inserting rad-hard components in compatible areas, this board is the CSPv1

Rev. B flight board. Additionally, it has the ancillary benefit of allowing mission designers the

option to purchase or configure a low-cost, easy-to-develop, all-commercial testbed that is highly

reflective of flight hardware. Configurable subsystems within CSPv1 Rev. B that include

selective options are the non-volatile NAND Flash memory (commercial Spansion 8Gb or

radiation-tolerant 3D-Plus 32Gb), switching regulators (commercial Texas Instrument or rad-

hard e2v Peregrine), linear regulators (commercial Texas Instruments or rad-hard Cobham),

supervisory circuit (Intersil commercial and rad-hard variants), power sequencing (Texas

Instruments commercial and rad-hard variants), and finally reset management (Texas

Instruments commercial and rad-hard variants).

A B C

Figure 6-2. CSPv1 designs. A) CSP Rev. B EM. B) CSP Rev. B Flight. C) CSP Rev. C.

75

CSPv1 requires two input voltage rails (3.3V and 5.0V) to the board, and generates the

remaining necessary voltages with internal regulation. The entire CSPv1 Rev. B board has been

analyzed for total power consumption in a variety of different configurations to reflect use cases

that may be expected in several mission scenarios, as listed in Table 6-3. To measure power

consumption, the CSPv1 was interfaced with a test board that was used solely to provide power

inputs. The measurements shown in Table 6-3 were taken from an external power supply

connected to the CSPv1 through this test board.

Table 6-3. CSPv1 Rev. B power consumption.

Operating Frequency (MHz) Test Load

ARM DDR FPGA ARM FPGA Power (W)

200 200 100 Low Low 1.54

200 200 100 High Low 1.78

667 533 100 High Low 2.23

667 533 100 High High 2.86

For ground testing, the CSPv1 connects to an evaluation board with convenient interfaces

for rapid desktop prototyping of flight designs. The evaluation board exposes internal signals on

the CSPv1 connector from both the ARM and FPGA side of the device. For FPGA signals, the

evaluation board provides connectors for Camera Link, SpaceWire, and several spare single-

ended and differential signals. For ARM signals, the evaluation board provides Ethernet and

USB-Host capabilities. A secondary purpose of the evaluation board is to serve as a reference

design for the integration of various interfaces and devices. Two revisions of the evaluation

board exist and are pictured in Figure 6-3. The Rev. B added quality-of-life changes including a

3rd SpaceWire connector, a 1V8 regulator for the Ethernet PHY (in place of using the CSPv1

76

regulation), debug LEDs, and additional mounting holes. The datasheet for the CSPv1 can be

downloaded from the Space Micro product page2.

A B

Figure 6-3. CSPv1 Rev. B mated to Evaluation Boards. A) Rev. A Evaluation Board. B) Rev. B

Evaluation Board.

Software Design

CSPv1 is equipped with an extensive and thoroughly tested software package. This

package includes support for two operating systems (Linux and Real-Time Executive for

Multiprocessor Systems or RTEMS), a variety of applications and drivers developed by the

research center, and platform-support packages for both operating systems in Core Flight

Executive (cFE3), NASA Goddard’s open-source, reusable, flight-software framework for local-

device management, event generation, software messaging, and support libraries.

Our research center has developed its own lightweight Linux environment, named

Wumbo, built using Buildroot. Xilinx's linux-xlnx fork of Linux is used as the kernel, and

2 http://www.spacemicro.com/assets/datasheets/digital/slices/CSP.pdf

3 http://coreflightsystem.org/

77

BusyBox is used for most of its user-space tools. For missions with real-time constraints, the

CSPv1 supports the open-source RTEMS with comparable CSPv1 support to Linux. Finally, it

should be noted that VxWorks does run on CSPv1, however, a full CSPv1 development system

is not supported due to the overhead of creating custom drivers to provide the expected

functional feature set. To perform onboard processing, the research center has developed serial

as well as OpenMP-, NEON-, and OpenMPI-accelerated versions of commonly used image-

processing applications including image compression and basic filtering.

Fault-Tolerant Computing Options

The CSPv1 architecture was designed to support the selective population scheme to

enable commercial components to be replaced with rad-hard variants in board assembly.

Additionally, the Zynq has three internal watchdogs which can be used to detect and correct

system faults. An external supervisor circuit with hardware watchdog was integrated into the

CSPv1 to monitor the processing device for radiation upsets, and reset if the processor is not able

to mitigate a fault internally.

The ARM side of the Zynq is connected to the non-volatile memory and is responsible

for configuring the system, including the FPGA, on boot. As a precaution, for the critical booting

process, CSPv1 repurposes the built-in RSA authentication features of the Zynq to check boot

images before startup. As an additional safety measure, multiple boot images can be stored in a

read-only partition of the non-volatile memory to be used as a fallback should any images

become corrupted by radiation during the mission duration. On boot, the Zynq BootROM will

continue to search for images to load until it finds a valid (uncorrupted) one.

Once the boot image is verified and the device is booted, the CSPv1 runs Wumbo.

Optional improvements and modifications to the system can be made to increase fault tolerance

including: disabling the caches, enabling error-correcting codes (ECC) on the DDR3 memory,

78

and reporting parity faults on the caches if enabled. Fault detection within the kernel is also

improved with the addition of rebooting on kernel panics, soft- and hard-lockup detection, and

the Error Detection and Correction (EDAC) module. Together, these improvements achieve

higher reliability, longer average system up-time, and more detailed system reports on upset

events.

One of the main challenges for incorporating an FPGA device in a spacecraft system

stems from the SRAM-based memory architecture, which is susceptible to SEEs. SEEs can

manifest as bit flips in FPGA configuration or data memory, which can eventually lead to device

failure. One solution to these issues is configuration scrubbing; the process of quickly repairing

these configuration-bit upsets in the FPGA before they accumulate and lead to a failure. CSPv1

features a readback scrubber which periodically reads back the entire configuration memory and

performs writes to configuration frames that correct configuration memory without disturbing

other dynamic portions of memory. Additionally, CSPv1 has access to a more efficient hybrid

scrubber, which reduces overhead significantly and improves error-correction latency over the

readback scrubber, and takes advantage of both built-in, single-bit correction and ECC.

A fault-tolerant framework was developed for hybrid CPU and FPGA architectures in

[73] that can be applied to CSPv1, this framework is further described in Chapter 10. This

framework takes advantage of the Zynq’s architecture to provide several different fault-tolerant

modes (e.g. duplex with compare, triple-modular redundancy) by leveraging both the ARM cores

and FPGA fabric.

79

Design Revisions

As with any development system, the CSPv1 has undergone several revisions to make

improvements. The changes from Rev. A to Rev. B were minor, including: (1) tweaking passive

component values; (2) revising footprints for better connections; (3) adding additional mounting

holes for mechanical stability; (4) changing FPGA connections to the NAND memory chip

enable for bank switching; and finally (5) replacing a Zener diode with a voltage divider due to

radiation survivability concerns. The CSPv1 Rev. B was designed for Low-Earth orbit (LEO)

and similar orbital conditions. The CSPv1s currently in orbit have not experienced any critical

failures. However, interested sponsors advocated the use of CSPv1 for more challenging

environments, therefore, major design changes were applied from Rev. B to Rev. C in support of

deep space and Lunar missions that require higher reliability. During the heavy-ion radiation test

of the Flight Rev. B design, testers discovered that under certain conditions, heavy radiation

exposure caused part of the power system to malfunction, specifically, the board would lose the

“power good” status and crash. After further component analysis, the CSP team predicted that

the most likely culprit of these failures was the commercial Texas Instruments (TI) DDR

regulator (TPS51116PWP). Since the original design, TI developed new rad-hard components

that could be used to replace the commercial regulator. The CSPv1 Rev. C replaces the

commercial regulator with a pair of radiation-hardened regulators (TPS7H3301-SP, TPS50601-

SP) to provide the same functionality. It was later confirmed by a radiation test of the Rev. C that

this issue had been resolved. Unfortunately, due to the extensive nature of these changes, the

CSPv1 Rev. C, pictured in Figure 6-2c, does not support the selective population scheme

featured in the Rev. B.

80

CHAPTER 7 PERFORMANCE ANALYSIS OF CSPv1

This section studies the performance of the Xilinx Zynq featured on CSPv1, to provide an

example of the general capability to be expected, and to emphasize the benefit of commercial

processors over state-of-the-art rad-hard processors. Table 7-1 provides the maximum theoretical

throughput (gathered from vendor datasheets) for the devices with the concept of device metrics

described by Lovelly et al. [61]. As Lovelly describes, Computational Density (CD), measured in

GigaOps/second (GOPS), is a metric to describe the steady-state performance of a processor’s

computation for a stream of independent operations. These numbers represent the stand-alone

processor architecture, and do not reflect interactions with on-chip memories (caches etc…),

external memories (DDR memory), or off-chip resources. Lovelly, provides separate metrics to

measure these interactions called Internal Memory Bandwidth (IMB), External Memory

Bandwidth (EMB), and Input/Output Bandwidth (IOB) respectively. These calculations are

based upon a 50-50 mix of addition and multiplication operations, that is representative of

common and critical operations in many computational kernels of space applications. Table 7-1’s

columns for CD illustrate the disparity in performance between established rad-hard devices

(HXRHPPC, RAD750, GR712RC, GR740) and the commercial device (Zynq). Additionally,

these results can also be scaled with respect to the devices’ power consumption, as shown in

Table 7-1’s columns for CD per Watt (CD/W), further highlighting the dramatic efficiency gains

of the Zynq over rad-hard devices.

To provide an alternate view of processor performance, we benchmark the available

Zynq processors options with CoreMark, a benchmark developed by the Embedded

Microprocessor Benchmark Consortium. CoreMark contains list processing, matrix

manipulations, state machine, and CRC (cyclic redundancy check) calculations which are

81

frequently used in flight avionics operations. Table 7-2 displays the results of the benchmarks for

a single ARM core with varying cache configurations. These results were run on the Digilent

Zedboard, a commercial evaluation board with near identical Zynq part to CSPv1. The L2 cache

does not appear to improve the performance of the CoreMark benchmark significantly, however,

this result is explained by the benchmark's low memory usage versus the 32KB L1 instruction

and data caches. The benchmarks used in Table 7-2 were compiled with

“PERFORMANCE_RUN" configuration and -O2 compiler optimizations. The number of

iterations tested varied, 1,000 for MicroBlaze and 100,000 for ARM. This discrepancy results

from the time required for the MicroBlaze to execute 100,000 iterations. The precision error is

minimal over 10 seconds of execution. The Microblaze was configured with PERFORMANCE

optimization (5-stage pipeline), integer multiplication/division and FPU were enabled, and

caches disabled.

Table 7-1. Computational density and computational density per Watt of popular rad-hard

processors and the Zynq

Processor CD (GOPS) CD/W (GOPS/W)

Int8 Int16 Int32 SPFP DPFP Int8 Int16 Int32 SPFP DPFP

Honeywell

HXRHPPC 0.08 0.08 0.08 0.08 0.04 0.01 0.01 0.01 0.01 0.01

BAE Systems

RAD750 0.27 0.27 0.27 0.13 0.13 0.05 0.05 0.05 0.03 0.03

Cobham GR712RC 0.08 0.08 0.08 0.03 0.03 0.05 0.05 0.05 0.02 0.02

Cobham GR740 1.00 1.00 1.00 1.00 1.00 0.67 0.67 0.67 0.67 0.67

Xilinx Zynq-7020 283.3 152 46.29 40.57 12.63 60.41 36.91 10.72 7.83 3.06

This analysis with caches is relevant for reliability considerations. The caches occupy a

significant portion of area on any processing device, and therefore, have a higher probability of

experiencing a radiation-induced upset. During neutron testing of the Zynq for CSPv1 as

described in [78], it was noted that the Zynq experienced much worse reliability with caches

enabled. While these L1 and L2 caches on the Zynq are reported to have parity-bit checking

82

enabled, the CSP team was unable to develop an automated and easy-to-use process for cache

parity-triggered recovery. Although the system receives an interrupt when a cache error occurs,

these errors cause the processor to hang, making it difficult to recover. Unfortunately, the Zynq

Technical Reference Manual notes the L1 “D-Cache only supports write-back/write-allocate

policy. Write-through and write-back/no-write-allocate policies are not implemented” which

removes a reliability solution for cache errors by switching to write-through policy. Future

experiments are planned on STP-H5/CSP to compare the upset rate of the system with and

without caches enabled. Due to this discovery, it is recommended that space-based Zynq systems

do not enable caches if reliability is the most critical factor.

Table 7-2. CoreMark benchmarking.

Configuration Iterations/sec

Single-Core ARM with Caches Enabled 1980.2979

Single-Core ARM w/o L2 Cache 1971.2254

Single-Core ARM w/o Caches 116.9640

FPGA soft-core MicroBlaze 9.5975

83

CHAPTER 8 RELIABILITY ANALYSIS OF CSPv1

One of the most crucial requirements for space-system designs is the board reliability;

specifically, its capability to withstand a wide temperature range, the mechanical hardships of

launch, the vacuum of space, and the harsh radiation environment. To prepare for the first CSPv1

missions, the flight and commercial boards were extensively tested in several experiments, and a

reliability methodology was developed to help predict radiation effects in the specific

environments targeted.

Radiation Testing Results

The CSPv1 has been radiation tested in several radiation-beam experiments. This section

describes the outcomes and lessons learned at each test.

Neutron Testing

High-energy neutron testing provides an estimation of system reliability in radiation-rich

environments. The CSPv1 flight board was tested under a narrow beam for several days at the

Los Alamos Neutron Science Center (LANSCE) in December 2014 (shown in Figure 8-1a). The

recorded logs revealed the rad-hard watchdog timer rebooted the board and the EDAC Linux

kernel module reported ECC errors in the DRAM and parity errors in the L2 cache as expected.

Hundreds of errors reported by the Linux kernel were logged over the serial terminal for

analysis. Study of those logs indicated that about 75% of the reboots originated in L2-cache

events, and it is suspected that a majority of the remaining events were caused by the L1 cache,

which were not reported at that time. Additionally, this experiment helped stress test the

hardware watchdog-timer circuitry.

Another neutron-beam test was performed in May 2015 at the TRIUMF facility in

Vancouver, Figure 8-1b, on both the commercial CSPv1 board and several Zynq-based

84

development boards in testing the cross-section of the caches and on-chip memory. Analysis of

the logs showed that the FPGA configuration-memory readback scrubber reported many single-

and multi-bit upsets on the commercial CSPv1 board. The cache and on-chip memory cross-

section tests showed that the no-caches configuration provides a viable option for improved

reliability at the cost of performance.

Brookhaven National Laboratory October 2015 Radiation Test

A heavy-ion test was performed on the CSPv1 Rev. B at the Brookhaven National

Laboratory (BNL) NASA Space Radiation Laboratory (NSRL) facility on October 23, 2015

through October 25, 2015. The test used the "Variable Depth Bragg Peak" (VDBP) method of

SEU testing [79] to enable the team to perform a heavy-ion "system-level" test that irradiates all

parts of the CSPv1 board without delidding or thinning, which is frequently required at other test

facilities. There were two primary goals for this experiment. The first goal was to determine the

survivability of the CSPv1 board by sweeping the board through a wide range of Linear Energy

Transfer (LET) values. This test was completed to determine if any permanent catastrophic

failures could occur on the system. The second goal was to better understand the single-event

upset (SEU) response of the system at a wide range of LET values. This experimental setup is

displayed in Figure 8-1c.

This test was successful and allowed the CSP team to study the CSPv1 Rev. B flight

board’s failure modes due to radiation effects. The first observation was that the Linux system

suffered from many SEU-related kernel reboots, which were logged throughout the experiment.

This experiment confirmed the presence of the Xilinx 7-series “high-current” event (micro-

latchup) at higher LET values, however, the event is not destructive and can be resolved with a

power cycle. Finally, the most critical observation is that under certain scenarios, a new failure

mode was detected where the “power good” status of the regulators would fail. This new failure

85

mode was not a permanent failure and was recovered on a power cycle, however, it showed the

team that more analysis needed to be conducted. Overall, the boards tested were fully functional

(following a power cycle) after testing had concluded, which indicated there were no

components that would catastrophically latchup, which could prematurely end mission

operations.

A

B

Figure 8-1. CSP at test facilities. A) CSPv1 at LANSCE. B) CSPv1 at TRIUMF. C) CSPv1 at

BNL NSRL.

Brookhaven National Laboratory October 2016 Radiation Test

The first BNL radiation test proved that the CSPv1 Rev. B. flight boards are highly likely

to meet expectations in the targeted LEO and LEO-like environments. However, several of the

research center’s partners desire to use the CSPv1 in more challenging radiation environments.

To make the board more reliable for harsher environments, and remove the “power good” failure

86

mode detected in the previous test, the CSP team developed the CSPv1 Rev. C. Fortuitously,

sponsored by NASA Johnson Space Center’s EV511 group, the CSP team returned to BNL for a

second heavy-ion test.

The CSPv1 Rev. C successfully passed a survivability test without suffering any

permanent damage. The device returned to a fully functional state after manual power cycling

proceeding each run. Additional experiments were conducted to: characterize micro-latchups that

have been previously observed on the Zynq; study the behavior of the ancillary reset circuitry;

and finally, profile the reliability of the NAND Flash memory. The team confirmed the new

design did not suffer from the “power good” error state experienced in the previous test.

However, a new error-mode condition was detected which occurred solely under high flux. At

high flux rates, a current drop-out mode occurs on the 5V0 supply rail causing the Zynq to hang,

preventing the system from rebooting. Fortunately, none of the errors observed during testing

pose a threat to flight, as the conditions needed to trigger them required much higher flux rates

than would be encountered in projected mission orbits.

Radiation Environment Upset Prediction

To estimate the reliability of CSPv1 in different environments, we developed a new

methodology for estimating reliability of space computers from the system-level perspective.

The method, fully described in Chapter 5, can then be used to build a first-order estimate of the

reliability of the system given specific mission-environment conditions. These measures can be

used to assist in making component or device selections by comparing the reliability of the same

design with certain components replaced, comparing the reliability of different space computer

options, and comparing hardware and software fault tolerance within the board design.

One frequent challenge for estimating the reliability of commercial components is that many

commercial parts selected have not been through any degree of qualification or radiation-

87

reliability selection scheme, and even more rarely has the behavior been confirmed and tested in

a radiation-beam environment. The developed methodology consists of four key stages that can

be used to predict upset rate and failure of a design in various orbits, as well as, help make

effective component selections in the board design phase. This methodology was applied to

CSPv1, such that all components (especially commercial components) were analyzed to build a

system model to generate upset rates the CSPv1 may experience in different orbits, relying on

state-of-the-art tools such as CRÈME961, SPENVIS2, and PTC Windchill Quality Solutions3.

Workmanship Reliability

The CSP flight box on STP-H5 underwent environmental and workmanship testing with

STP/H5-ISEM and again with the full STP-H5 pallet. ISEM was required to undergo a

workmanship-level, random-vibration test and a thermal cycle test. The random-vibration test is

performed to identify latent defects and manufacturing flaws in electrical, electronic, and

electromechanical hardware at the component level. The thermal cycle test is performed to

confirm expected performance of a device in a temperature range enveloping mission conditions.

The random-vibration test was performed unpowered, with a sine sweep prior to and after

each principle axis (X, Y, Z). The results of a sine sweep are compared before and after the

random-vibration test to verify there were no changes in frequencies. Any major changes would

indicate an alteration in the structure and would need to be investigated. The workmanship

vibration test of the ISEM assembly was performed successfully on all three axes, with no

significant changes detected during the sine sweeps.

1 https://creme.isde.vanderbilt.edu/

2 https://www.spenvis.oma.be

3 http://www.ptc.com/product-lifecycle-management/windchill/quality

88

The ISEM assembly also underwent a full thermal-vacuum (TVAC) test. A temperature

profile range is selected based on the limits of the components involved and the expected

temperatures on orbit, to expose the assembly to the maximum operational flexibility expected.

The general profile consisted of two cycles in vacuum with a hot operational plateau of 50°C and

a cold operational plateau of -10°C, at the ISEM baseplate interface. A full-functional

performance test was performed at each plateau, with nominal on-orbit activities occurring

during the temperature transitions. The test was performed using minimum and maximum input

voltage at various stages to capture corner cases, as the specified input voltage could be subtly

different based on power converter performance and signal integrity. The STP-H5/CSP

performed nominally throughout the TVAC test, which indicates readiness for mission exposure.

Further details for these experiments can be reviewed in Wilson et al. [78].

89

CHAPTER 9 HIGHLIGHTS OF STP-H5/CSP MISSION EXPERIMENT

This chapter describes the mission-specific configuration of the CSPv1 on STP-H5

(Figure 9-1). STP-H5 was launched February 19th, 2017 and docked with the ISS February 23rd,

2017 and placed on the ExPRESS Logistics Carrier-1 (ELC-1). Now installed, STP-H5/CSP

will be a continuous-development platform for software testing, because new applications,

design cores, and upgrades can be uploaded and tested on board.

Figure 9-1. STP-H51 Pallet 3D-view and integrated-for-flight system

Figure 9-2. STP-H5/CSP flight unit.

1 Photo courtesy of the DoD Space Test Program

90

Mission Configuration

This section describes the main components that constitute STP-H5. This description

includes an overview of the hardware, software, and ground-station operation.

Hardware

The STP-H5/CSP flight box (Figure 9-2) can fit four boards in a 1U form-factor: two

hybrid-flight CSPv1 boards (CSP0 and CSP1); one power/interface board; and one custom

backplane interconnect board. The two CSPv1 boards are set up in a master-slave configuration

where CSP0 receives all ground commands and forwards requests to CSP1 as necessary. CSP0

contains a SpaceWire FPGA core to provide a communication interface to the SpaceCube Mini

and ISS. The backplane board is the central-interconnect interface connecting all the boards

together, directly routing traces between the main connectors. Two SpaceWire and UART

interfaces can be used to pass data between CSP0 and CSP1. The two-board configuration

enables configuration changes to be first tested on CSP1 before any reconfiguration to CSP0,

which is the main interface to the rest of the experiment. The power/interface board consists of

mostly radiation-hardened components and it routes and regulates power to the entire flight unit,

as well as, provides the main communication interfaces. Four external connectors are provided

on the CSP flight box: Camera Link; SpaceWire; power in; and debug I/O. External to the CSP

flight box, a Sony 5-megapixel color camera is interfaced using a Camera Link FPGA processing

pipeline, powered by the FPS experiment.

Software

The CSPv1 flight boards on STP-H5/CSP are configured to boot from the onboard

NAND flash. The Zynq's RSA fallback feature is used to achieve reliable booting with several

“golden” fallback images stored in a read-only partition in flash memory. The next partition

contains space to store additional boot images that are uploaded post-launch. In each boot image,

91

there is a First-Stage Boot Loader (FSBL), Second-Stage Boot Loader (U-Boot), FPGA

bitstream, and Wumbo Linux image. The Linux image uses an initramfs (initial RAM file

system) as its root filesystem and mounts a non-volatile JFFS2 partition after boot.

Contained in the Wumbo image are several key cFS applications. Significant cFS flight-

system applications include the Scheduler (SCH), Health Services (HS), File Manager (FM), and

Stored Commands (SC). SCH is used mostly to schedule telemetry requests to applications. HS

is used primarily to handle watchdog interaction. FM is used to manipulate local files. Finally,

SC is used to execute command sequences, such as an image capture at an absolute or relative

time.

Custom cFE applications that were developed for the STP-H5/CSP mission include:

Command Ingest (CI); Telemetry Output (TO); File Transfer (FT); File Transfer Delivery

Protocol (FTDP); FTDP Receive (FTDPRECV); FTDP Send (FTDPSEND); Image Processing

(IP); Camera Control (CCTL); Self-Timer (SELF_TIMER); CSP Health (CSPH); and Scrubber

(SCR). A custom communication library supplies a frontend for CI and TO to the

communication interface. Depending on compilation options, the backend can be either

SpaceWire or POSIX sockets, and is designed to be transparent to applications. CCTL is used to

interact with the camera, and communicates with SELF_TIMER to capture images at specified

intervals. FTDP and FT are used for file upload and download, respectively. File uploads are

performed over the Communications Interface Board (CIB) which acts as the interface between

the ISS and all the experiments on STP-H5, and downloads are streamed in High-Rate Telemetry

(HRT). IP creates thumbnails of captured images, which are streamed to the ground in

JPEG2000 format. CSPH streams health data such as device temperature, uptime, and memory

92

and CPU utilization, from each of the two flight boards. Lastly, SCR reports messages from the

readback scrubber, which has configurable parameters for scrub rate and detailed error messages.

Ground Station

To monitor the progress of the mission and perform all primary and secondary objectives,

a ground station is setup with commanding software. The ground station deploys the Telescience

Resource Kit (TReK2) to receive and monitor packets sent from STP-H5/CSP. Packets are

received and sent through a graphical user interface (GUI) built to interact with the TReK

software. This GUI was developed with the open source Interoperable Remote Component

(IRC3) application framework with an example configuration for this mission provided by NASA

Goddard. The application framework uses XML descriptions that can be modified to easily

parse, interpret, and display incoming data, as well as, send commands. IRC can be used to save

and store commands through the GUI interface. The GUI also enables the operator to select and

send commands. A Python extension was developed to interface with TReK using Python

scripts. One key Python script is the image viewer, which downloads and displays the thumbnail

images streamed from the STP-H5/CSP flight box.

Primary Mission Objectives

STP-H5/CSP has several primary requirements to fulfill in order to declare mission

success. The first objective of STP-H5/CSP is to advance the Technology Readiness Level

(TRL) of the Xilinx Zynq SoC in Low-Earth Orbit. This device is crucial for study in the

development of a new generation of space computers. It is also one of many devices that are

2 https://www.nasa.gov/sites/default/files/atoms/files/g-28367c_trek.pdf

3 https://opensource.gsfc.nasa.gov/projects/IRC/index.php

93

being considered for the next generation of the SpaceCube family of reconfigurable computers

developed by NASA Goddard’s Science Data Processing Branch.

Another key directive for the mission is to closely monitor and record the upset rates of

both the processing system and programmable logic of the Zynq to provide environmental

information in preparation for future missions. The main upset rates to be examined are the

performance of the ARM cores, as well as, the L1 and L2 caches.

The final primary requirement is to perform image processing, including noise reduction

and image enhancement, on terrestrial-scene data products. Image processing will be

demonstrated with hardware acceleration in the FPGA fabric and compared with processing on

the ARM cores with NEON acceleration. These high-resolution images can then be compressed

using JPEG2000 or converted to PPM for downlink as thumbnails or complete images and

displayed on the ground-station image viewer.

Secondary Mission Objectives

As a technology mission and experiment, STP-H5/CSP has the freedom to explore

additional research-oriented tasks as well as the ability to upload new applications and software,

when not performing primary mission tasks. There are several secondary objectives that will be

explored throughout the duration of the mission including autonomous operations, partial

reconfiguration, space middleware, device virtualization, and dynamic synthesis.

Autonomous Operations

The IP app provides access to our image-processing suite, which includes several

algorithms to perform a variety of functions. For future space-processing missions, it may

become necessary for processing tasks to be completed autonomously. Basic exploratory

functions have been added to CSPv1 to begin testing this domain of applications. The IP app has

a set of algorithms for classifying images. These algorithms can allow CSP0 to autonomously

94

make decisions about which images to keep, without user intervention. In a restricted downlink

scenario, this app can determine if an image taken is unnecessary (e.g., an all-white image from

cloud cover, or all-blue from just the sea), and can delete the image, saving storage capacity as

well as preventing this picture from wasting downlink bandwidth.

In-Situ Upload Capability

The CSP flight box has additional software features, which include software and

firmware uploads. Flight software updates will primarily be made by uploading new cFE table

and configuration files. cFE tables can be used to change the behavior of applications, or even to

load new applications. As an example, an SC table can be uploaded that includes commands for

cFE to start an uploaded cFE application, or stop an old version and load a new one from flash

memory. For more drastic changes, such as a Linux kernel update, new boot images can be

uploaded and stored in the partition in the region after the golden images as described previously.

The new environment will contain instructions for U-Boot on booting the new image. If the U-

Boot environment ever becomes corrupt, U-Boot will default to booting the golden image.

Lastly, additional functionality on this mission includes file transfer between CSP0 and CSP1.

The FTDPRECV and FTDPSEND apps can allow the transfer of large files or configurations

between the two flight boards.

Partial Reconfiguration

The CSPv1 will be one of the first deployed space computers to include Partial

Reconfiguration (PR) functionality. PR, as described in Chapter 2, is the process of changing a

specialized section of reconfigurable hardware during operational runtime. The CSPv1 allows

multiple applications to be performed in the FPGA fabric without reconfiguring the entire

device. PR can be used in space missions to reduce the total-area utilization of the fabric by

switching out designs to reduce the vulnerable configuration area, employing fault-tolerant

95

reconfigurable structures, and allowing new algorithms and applications to be uploaded after

completion of the primary mission. PR can improve the performance of a device by allowing the

user to include a suite of application designs to fit within a PR region, enabling a larger number

of applications to be accelerated by hardware, rather than limited by a single static FPGA design.

The CSPv1’s corrective scrubbing and error logging are also available to PR design regions.

Space Middleware

The CSP explores new fault-tolerant approaches beyond pure hardware radiation-

tolerance by extending its fault-mitigation considerations to flight software. In contrast to FPGA

mitigation techniques discussed in previous sections, this experimental research takes a

processor-centric perspective to assist in developing resilient applications on the processing

system as the Adaptive Dependable Distributed Aerospace Middleware (ADDAM). The

ADDAM research is motivated by a pursuit to provide a middleware platform of software

services for fault-tolerant computing in harsh environments where execution errors are expected

to be common in occurrence.

The means for accomplishing software resilience is through process redundancy: through

a system of multiple processes operating in pursuit of a common application, the resilience is

ameliorated while mitigating individual instances of execution failure. In order to recover from

potential failures in processes over the application execution, the processes are developed with

ADDAM through task division. Task division in the system is modeled after a traditional

message-passing system and these tasks can be distinct for distributed processing, or replicated

for increased redundancy.

Each process has a unique identifier, referenced globally in the network of processes for

peer communication. The identifier is also used for correlating a process with its role of either

the coordinator instance or worker instance, of which the same process can assume either role as

96

needed. Worker failover is handled by task re-issue from the coordinator, coordinator failover is

being developed through distributed election, and both types of failover are assisted with process

restart through a cyclical processor monitor to prevent ADDAM process extinction through

successive execution faults.

The latest prototype of ADDAM provides fault awareness to an app developer via an

internal publish/subscribe messaging system for propagating events. The messaging system

operates on events generated by discrete modules based on specific functionality. Currently,

ADDAM generates events for process discovery, tracking peer connections and disconnections

through heartbeats for the health reactor, which in turn generates events used for both the task

manager as it dispatches workload divisions specified by the developer, and the coordination

manager for determining process roles. Advanced fault- mitigation strategies and execution

patterns can be developed to adapt behavior depending on mission parameters. Through this

system, an extensible platform for generating fault awareness is available as another tool for

incorporating fault-tolerant computing techniques onto a variety of space computers.

Device Virtualization and Dynamic Synthesis

The last secondary goal of the ISEM-CSP mission is to demonstrate an improved

productivity tool set by generating FPGA designs through device virtualization and dynamic

synthesis. This research will allow future adopters of CSPv1 to have an easier effort in adapting

FPGA designs to make use of the full SoC system. The performance and power advantages of

FPGA hybrid computing system are well established, but have attendant challenges that have

limited adoption of the technology. From the perspective of application designers, writing

FPGA-accelerated code is a time-consuming process, complicated by low-level and relatively

unfamiliar hardware-description languages (e.g. VHDL) typically used in design, and lengthy

hardware-compilation times of tens of minutes to hours required even to make minor design

97

changes [80]. The effectiveness of FPGA-accelerated cores is also limited by the efficiency of

data transfer between the design cores and host software, which requires careful consideration of

data-access patterns and work in kernel to optimize memory bandwidth.

From the perspective of system designers, FPGA acceleration poses additional

challenges: how can multiple applications be supported efficiently using common and limited

hardware resources (e.g. ultimately FPGA area); how can these systems be made resilient against

changing applications and workloads; and how can system security be ensured when applications

are encouraged to modify hardware, especially hardware with access to system memory and

other privileged resources? These challenges are even more significant for space systems, where

high launch costs can be better amortized by more flexible systems. Similarly, the cost of system

failure due to errant hardware is significantly higher, with limited options for remediation.

Academic work on device virtualization and dynamic synthesis from high-level

languages such as OpenCL [80] has shown significant promise to help address these challenges

[81]. Device virtualization raises the fine-grained FPGA device (e.g. lookup table and register

logic resources) up to the higher level of an application or domain by compiling to flexible high-

level overlays rather than directly to the device.

Figure 9-3. CLIF OpenCL Framework.

98

CSPv1 integrates an implementation of OpenCL that uses this approach, called CLIF [80,

81], as illustrated in Figure 9-3. Applications using this framework are written against a C task

and data API, with computational kernels specified in the OpenCL kernel language. Unlike other

OpenCL implementations for FPGAs, applications package their kernels’ source and rely on

CLIF’s runtime compiler to handle device mapping. This mapping is performed using overlays

from the system’s overlay library, which can improve system flexibility in multiple scenarios:

• Hardware/software partitioning is deferred until runtime, where it may be informed by

dynamic properties of the system (e.g., power, damaged regions, or the needs of other

workloads).

• New applications or changes are added by small patches to application software, and

hardware accelerated using support already in the overlay library or added through newly

uploaded overlays.

• The system is free to introduce error mitigation or detection, or even optimizations,

without requiring changes to application software (e.g., binding to fault-tolerant overlay

instances).

This approach has other benefits for system design and security. High-level kernel

descriptions permit the compiler to perform optimizations that can be infeasible for human

designers. For example, previous work has shown that aggressive inter-kernel resource sharing

using overlays can result in up to 70% lower area [80], with up to 250x faster kernel switches

[81]. Since applications are implemented using the system’s overlays rather than directly using

FPGA resources, security policies can be enforced by restricting the capabilities provided by this

overlay library. For example, in our implementation, accelerators have high-performance access

to system memory through the Zynq coherency port. However, the addresses kernels can access

over this interface are restricted by each overlay’s memory controller to protect against faulting

or malignant applications.

99

Preliminary On-Orbit Results

The mission provides the CSP flight unit with flight heritage and proves it can survive

both launch and day-to-day space conditions. This section describes the current state-of-mission

and shows preliminary upset results for the ISS LEO orbit.

Figure 9-4. Example image products from STP-H5/CSP.

At the time of this publication, STP-H5/CSP has only been in operation for a short time.

So far, STP-H5/CSP has completed full-functional testing onboard the ISS. As expected, the

flight unit downlinks its health and status telemetry to the ground while the ISS maintains signal

with the operations center. The experiment captures images and downloads thumbnails of sensor

products every 10 minutes, with examples illustrated in Figure 9-4. Finally, the CSP flight unit

successfully accepts commands from the ground, and several operations have been conducted

using commands to change configuration settings onboard the flight unit.

Table 9-1 compares the worst-case predicted upset rate for a single CSP flight board with

no fault-tolerant capabilities enabled. It is expected that the predicted rates should be much

100

higher than the actual flight results because the model takes an extremely conservative approach

in every calculation. While data is downlinked from the unit almost continuously and archived

with the operations center, only the data that is stored locally on the ground station was available

for analysis. In 3547.2 hours (147.8 days) of recorded observation, CSP0 has sustained 15 SEFIs

on the ARM side, and 8 SEUs on the FPGA side. Similarly, CSP1 has sustained 10 SEFI on the

ARM side, and 10 on the FPGA side. Notably, on CSPv1’s FPGA side, two of the recorded

upsets were multi-bit upsets.

Table 9-1. CSP Board Upset Rate.

Computer Upsets/Day

CSP Model Prediction 1.9797

CSP0 FPGA 0.0541

CSP1 FPGA 0.0676

CSP0 ARM 0.1014

CSP1 ARM 0.0676

CSP Flight Unit Total 0.2909

101

CHAPTER 10 FAULT-TOLERANT FRAMEWORK FOR HYBRID DEVICES

The main design challenge in developing space computers featuring hybrid system-on-

chip (SoC) devices is determining the optimal combination of size, weight, power, cost,

performance, and reliability for the target mission, while addressing the complexity associated

with combining fixed and reconfigurable logic. This is significant, because with the successful

development and flight of the CSPv1 flight computer, SoC devices in space is a current reality,

and likely to be the future baselined option for next-generation missions. There are many

schemes for fault and error mitigation for both fixed-logic processors and reconfigurable-logic

FPGAs. Our research, however, focuses on developing a fault-tolerant computing strategy that

accounts for the hybrid nature of an SoC device and suggests a strategy that works cooperatively

between both types of architectures. We call this framework HARFT, for hybrid, adaptive, and

reconfigurable fault tolerance.

HARFT Use-Case and Design Overview

HARFT is designed to increase the reliability of space systems targeting SoC devices

composed of multicore CPUs and an FPGA fabric. Our HARFT strategy incorporates fault-

tolerant schemes within both architectures to create an overall robust, hybrid, fault-tolerant

theme for a hybrid device. This section describes the idea use cases for our system and describes

the major components of the framework.

Flight Example

In a science mission, a spacecraft may experience varying levels of radiation from several

sources including the South Atlantic Anomaly (Figure 10-1) and unexpected solar weather

conditions. The system operates by default in the SMP mode. The configuration manager

102

changes the mode dynamically, by reading the current upset rate detected by the scrubber, or

from previously set configurations defined by the ground station.

Figure 10-1. World Map1 displaying proton flux at South Atlantic Anomaly.

HARFT Hardware Architecture

HARFT is subdivided into three main subsystems: the hard- processing system (HPS);

the soft-processing system (SPS); and the configuration manager (ConfigMan). The HPS

consists of the ARM dual-core Cortex-A9 processor and its internal resources. The SPS consists

of programmable-logic elements of the Artix-7 FPGA fabric. Figure 10-2 illustrates a high-level

block diagram of the architecture design.

Hard-Processing System (HPS)

The HPS encapsulates the ARM cores and all the processor resources. The Zynq

architecture does not support lockstep operation in Cortex-A9 cores; therefore, fault-tolerant

strategies on the HPS involve alternating between the SMP and AMP modes. Unfortunately,

there are some limitations to AMP on Xilinx devices. Xilinx documentation notes that since

there are both private and shared resources for each CPU, careful consideration is necessary to

prevent resource contention. Linux manages and controls most shared resources, so it is

1 https://www.spenvis.oma.be/help/background/traprad/traprad.html

103

infeasible to run Linux on both cores of the device simultaneously. CPU0 controlling shared

resources from Linux forces CPU1 to run an operating system with fewer restrictions, such as

FreeRTOS, or custom bare-metal software. Consequently, software developers may have to re-

write applications specifically for CPU1. Xilinx provides AMP-related projects and examples in

their application notes [82-84].

Figure 10-2. HARFT architecture diagram

Soft-Processing System (SPS)

The SPS constitutes a scalable number of PRRs and a static-logic component. Each PRR

can be configured as either a Xilinx MicroBlaze processor or an auxiliary hardware accelerator.

MicroBlazes instantiated in the PRRs operate in lockstep and aggregate as one redundant

processor. The static logic in the SPS contains a hybrid comparator/voter with AXI4 bus

arbitration, reset control, and PR glue logic.

Configuration Manager (ConfigMan)

An essential component of HARFT is the ConfigMan. This component is an independent,

triplicated MicroBlaze system executing operations in lockstep, residing in the static logic of the

programmable-fabric design. The ConfigMan is multipurpose, and can perform operations such

104

as FPGA configuration-memory scrubbing, act as a fault monitor by recording upset events, and

adapt the system by triggering fault-tolerant mode changes. The ConfigMan accesses the FPGA

configuration memory using the AXI Hardware Internal Configuration Access Port

(AXI_HWICAP) IP core (ICAPE2 primitive) and obtains the configuration memory frame ECC

syndrome using a custom AXI-based IP core (FRAME_ECCE2 primitive)2.

ConfigMan Scrubbing

To perform scrubbing, the ConfigMan instructs the ICAPE2 to readback one FPGA

frame. During this readback, the FPGA frame passes automatically through the FRAME_ECCE2

block to compute the ECC syndrome. The ConfigMan reads the FPGA frame from the

AXI_HWICAP buffer into local memory and reads the ECC syndrome from the

FRAME_ECCE2 block. If the syndrome is zero, then there was no error detected and the

ConfigMan proceeds to inspect the next FPGA frame. If the syndrome is nonzero then an error is

present and the syndrome is decoded to determine the word and bit location of the fault (Note:

some errors are detectable but are uncorrectable, which are resolved with a full system reset). An

FPGA frame is corrected by flipping the faulty bit in the frame stored in local memory, as

located by the ECC syndrome. The ConfigMan instructs the ICAPE2 for FPGA frame write-back

to correct the frame in configuration memory. There are 7692 frames in the Zynq-7020 device,

with 101 words per frame, and 32 bits per word. More information detailing these interactions

can be found by Stoddard [85].

2 http://www.xilinx.com/support/documentatio n/sw_manuals/xilinx14_7/7series_hdl.pdf

105

ConfigMan mode-switching mechanics

When the fault-tolerant mode changes, the ConfigMan transfers partial bitstream(s) from

DDR memory to the AXI_HWICAP for PR. A mode switch that increases the number of

processers (e.g., simplex to duplex) requires a reset of the SPS to resynchronize the MicroBlazes

for lockstep operation. However, when the mode switch decreases the number of processors

(e.g., TMR to simplex), no reset is required since the leftover MicroBlazes remain synchronized.

ConfigMan handles PR efficiently when switching modes; only the necessary regions are

reconfigured.

ConfigMan mode switching process

ConfigMan triggers mode switching in two ways. The first is an adaptive-mode switching

based on incoming upsets and recorded faults by the ConfigMan. Since the ConfigMan is

programmable, the user can program various algorithms, such as the windowing strategy in

Jacobs et al. [36]. The second mode switch occurs when the ConfigMan receives a command

from the ground station to place the system into a particular mode for a specific period of time.

An example of this need is for an incoming solar flare, where controllers on the ground can force

the ConfigMan prior to the event to change the fault-tolerant strategy in advance.

SPS Static Logic

The second essential component of HARFT is the SPS-Static Logic (SL). The SPS-SL is,

in essence, a custom IP core that is a hybrid comparator/voter combined with an AXI

Multiplexer. Each of the MicroBlazes from the PRRs includes lockstep signals, which partially

contain the processor state of the MicroBlaze (IP_AXI Instruction Bus and DP_AXI Data Bus).

These signals are inputs to the SPS and multiplexed to the output depending on the current fault-

tolerant mode configuration. Figure 10-3 illustrates the ConfigMan and SPS-SL interactions.

106

Figure 10-3. ConfigMan and SPS-SL architecture diagram.

Fault-Tolerant mode switching

The ConfigMan dynamically switches between three main fault-tolerant modes during

flight operations. These modes refer to a specific configuration of the HPS and the SPS on the

device. Figure 10-4 shows a graphical diagram highlighting the modes.

Figure 10-4. Illustrated fault-tolerant modes diagram.

1. SMP + Accelerators—In this mode, Linux runs on both Cortex-A9 cores in SMP mode.

The PRRs are allocated for hardware acceleration. This mode is the highest-performance

mode; the HPS provides high-performance software execution, accelerating applications

107

by using parallel computing tools, such as OpenMP, and leveraging hardware

accelerators instantiated in the FPGA.

2. AMP + Accelerators—In this mode, Linux runs on only one Cortex-A9 core (CPU0).

Depending on the mission constraints, a real-time operating system (RTOS), such as

FreeRTOS can run on CPU1 for real-time operations. Alternatively, CPU1 can run the

bare-metal equivalent to the Linux CPU0 application in a duplex-like mode, using shared

memory to pass status and health updates. In this scenario, the PRRs can also be allocated

to hardware acceleration.

3. FPGA-Enhanced Fault Tolerance (FEFT)—The final reliability mode refers to a number

of sub-configurations available in the FPGA fabric. The configurations describe

combinations of either MicroBlaze processors or hardware accelerators in the FPGA

fabric (e.g., two MicroBlaze processors in two PRRs, with remaining PRRs as hardware

accelerators). These configurations feature at least one MicroBlaze in a PRR, with the

rest of the PRRs filled with hardware accelerators. If there is more than one MicroBlaze,

they will operate in lockstep. Once this mode engages, the MicroBlaze(s) will take

control of key flight-system applications. This mode is the most reliable; however, the

MicroBlazes operate at a much slower clock frequency than the ARM cores on the HPS

system, and therefore have much lower performance.

Mode switching

The ConfigMan is responsible for switching modes in the FPGA while in the FEFT

mode. To switch between SMP and AMP, a simple script renames the boot files, since each

configuration has different settings for U-Boot and corresponding first-stage boot loader.

Challenges

When designing a system using HARFT, the developer should consider several issues for

a specific mission. We recommend HARFT for those familiar with Xilinx software development,

FPGA development, and Linux development. Configuration for AMP requires designers to

change configuration settings in U-Boot and make modifications to the stand-alone board

support package (BSP) for the first-stage boot loader and additional applications. For this design,

Xilinx provided the custom BSP supporting AMP on the Zynq. Additionally, we do not

recommend switching tool versions in development, since the build process varies drastically in

different Xilinx versions. At present, HARFT uses Vivado 2015.4 and SDK, and we encountered

108

several issues using Vivado including randomly disconnecting signals, and changing parameters

and configurations.

Flight configuration and use model

We designed HARFT to perform optimally in low-Earth orbit (LEO) and environments

that include a typical profile of generally lower upset rates with short bursts of time with

relatively higher upset rates. The limits of HARFT are closely tied with the radiation-effect

limits of the Zynq, and HARFT was specifically structured for the CSPv1flight unit

configuration. Developers may wish to fly the 7-series Zynq with caches disabled due to the

behavior described by Wilson et al. [86]. We also recommend ECC on the DDR memory due to

the need to store bitstreams between configurations. Finally, radiation-hardened or -tolerant

(with multiple images) non-volatile storage is recommended, so that boot images for SMP and

AMP modes remain uncorrupted.

Experiments and Results

This section discusses experiments and HARFT prototype development to evaluate our

ideas and architectural design. First, we discuss general experiments, which verify the limitations

of the processor modes and expected behavior on a testbed. These experiments show the strong

need for adaptive flexibility in a changing radiation environment. Next, this section provides a

brief overview of the radiation-effects methodology introduced in Chapter 5 that determines the

estimated effectiveness of our proposed method. Finally, we describe the developed prototype

for HARFT, highlight conducted metrics and benchmarks, show the FPGA resource utilization

and scrubber performance, and discuss expected HARFT behavior due to radiation effects.

Processor Experiments

We examine several processor tests as part of the problem-determination phase of this

research and for familiarization with AMP configuration on the Zynq. These tests consist of

109

configuring the operating system for each test, and then halting one of the cores or corrupting the

program counter (PC) in order to crash the program using the built-in debugging tools.

Basic SMP experiment

This simple experiment confirms that unexpected errors (which could be the result of an

SEE) in one of the cores in SMP mode will lead to a system crash. This outcome is significant

because if SMP does not crash from an upset in one of the cores then AMP would not be

necessary. Xilinux (Xilinx Linux) ran across both CPU0 and CPU1 in SMP mode. We conducted

10 runs for each test (halting and crashing) on both processing cores. When one of the cores halts

the system, the behavior is not deterministic, and occasionally, in several tests, the system would

continue to operate, while in others the system suffered a crash. When the PC of one of the cores

changes to an unexpected address, the system always results in a crash.

Basic AMP experiment

This experiment shows the resilience of an AMP-configured design, and establishes that

it performs as expected on a hardware testbed. In this experiment, CPU0 runs Xilinux, CPU1

runs a bare-metal application, and a MicroBlaze runs another bare-metal application. Once again,

we conducted 10 runs for both types of tests on each of the processors. When either of the

processor cores halts, the other core continues to function nominally, and the MicroBlaze

remains unaffected. Similarly, when one of the cores has its PC set to an unexpected address, the

other core, as well as the MicroBlaze, continues operation as intended.

Reliability Modeling

To analyze HARFT, we create a dynamic fault-tree model as described in Chapter 5 as

part of a CubeSat reliability methodology. This methodology relies on tools including CRÈME

and PTC Windchill Predictions to build a model of the processing system and programmable

logic.

110

CRÈME96

CRÈME96 is a state-of-the-art tool for SEE-rate prediction. The tool allows the user to

generate upset rates for individual components in varying Earth orbits. CRÈME also allows a

user to simulate different conditions of an orbit as it relates to solar weather and galactic cosmic

rays.

Modeling methodology

The research in Chapter 5 provides a methodology for estimating the reliability of

SmallSat computers in radiation environments. Our analysis uses the microprocessor submodule

model to show upset rates of the programmable-logic and processing-system portions of the

Zynq. In this submodule, each mode has a constructed dynamic fault tree (DFT) that models the

Zynq architecture. For our analysis, we use proprietary Weibull curves (inputs into CRÈME)

gathered for the main Zynq components in the processing system and programmable logic from

radiation test reports. CRÈME then generates the upset rates based on the specified orbit. The

DFT-submodule “basic events” have the previously calculated CRÈME upset rates as inputs.

HARFT Prototype Description

As a proof-of-concept for HARFT, we create a prototype design using a Digilent

ZedBoard containing the Zynq-7020 SoC. While our HARFT description encompasses a number

of possible configuration options, this section describes a single configuration that we built as a

prototype.

HPS configuration

In the prototype, the HPS is configured with a ZedBoard running a branch of Xilinx

Linux. U-Boot and the device tree are modified to add the necessary design-specific drivers,

force single-processor operation (for AMP), and restrict DDR memory access available for the

111

system (DDR memory must be reserved for the MicroBlaze and to store configurations). CPU1

runs a simple bare-metal application or FreeRTOS.

SPS configuration

HARFT supports any number of desired PRRs within the resource constraints; for this

prototype, we selected three PRRs. With three PRRs, possible modes for FEFT include Simplex,

Duplex, and Triplex. The MicroBlazes are instantiated within the design and configured for

maximum performance without caches or TLBs.

ConfigMan configuration

The ConfigMan maintains a user-configurable number of thresholds to switch modes. If

the ConfigMan detects a number of faults exceeding a threshold while scrubbing, it triggers a

new configuration.

Additional hardware configuration

The prototype contains cores that would not be needed in a flight configuration including

UARTS, PMOD UARTS, LED core, and switches. We place these cores explicitly for project

debugging and testing.

Table 10-1 and Table 10-2 list the resource utilization for the three PRRs and the

complete prototype on the device. Figure 10-5 shows the entire placed and routed design. The

cyan highlight denotes the ConfigMan, the light purple denotes the SPS-SL, and the light blue

denotes the hardware test cores (UARTs, LEDs, including bus logic etc.). Lastly, the yellow,

blue, and red regions represent the three PRRs.

112

Table 10-1. PRR resource utilization.

Resource PRR0 PRR1 PRR2 Total

Slice-LUTS 2428 2433 2440 13931

Slice-

Registers 1884 1884 1884 11303

BRAM Tile 0 0 0 1

RAMB36 0 0 0 0

RAMB18 0 0 0 2

DSP48E1 6 6 6 18

Table 10-2. Prototype total resource utilization.

Resource Used Available Utilization%

Slice-LUTS 13931 53200 26.19

Slice-Registers 11303 106400 10.62

BRAM Tile 1 140 0.71

RAMB36 0 140 0.00

RAMB18 2 280 0.71

DSP48E1 18 220 8.18

Figure 10-5. FPGA configuration area in floorplan view.

113

HARFT Prototype Analysis

For this analysis, we calculate upset rates for LEO. These rates show the reliability of

each mode respective to one another. The reliability of these modes in different orbits can be

extrapolated from the relationship between the modes established from LEO results.

Figure 10-6 and Figure 10-7 show the reliability of the main modes of HARFT.

Additionally, a reliability curve representing the FPGA, if every bit on the device is considered

essential, is provided as a reference and is labeled “FPGA” in the graph. For the FEFT-mode

calculations, we assume that any upset temporarily interrupting the processor is a failure. Using

this model, FEFT-Duplex and FEFT-Simplex show similar rates because a single upset would

cause either to fail, however, in practice FEFT-Duplex would detect the error, while FEFT-

Simplex would continue until the device failed or the scrubber detected the error.

Figure 10-6. HARFT reliability with L2 cache disabled.

114

Figure 10-7. HARFT reliability with L2 cache enabled.

For these calculations, Xilinx guidelines state that 10% of configuration-memory bits are

significant in any design. For the model, this fraction of each PRR and the static area is

calculated and scaled to 10% of the total sensitive device-configuration bits.

Results demonstrate that, as expected, level-two (L2) cache has a significant effect on the

overall reliability of the system. The L2 cache is responsible for a majority of upsets on the

processing system, therefore, Figure 10-6 shows the reliability of all HARFT system modes with

L2 cache disabled, while Figure 10-7 shows the same but with L2 cache enabled. Figure 10-7

shows that the most reliable mode for the system is the FEFT-TMR mode. In the chart, this

reliability is near one. The result is due to the low number of faults expected in LEO, while the

scrubber correction rate is extremely high as seen in Table 10-3, even under the worst-case

scrubbing scenario (needing to read the entire FPGA and then writing to the correct frame).

Figure 10-6 shows AMP is more reliable than SMP, while both are slightly more reliable than

FEFT-Duplex and FEFT-Simplex.

As cited above, Figure 10-7 shows the same LEO example with L2 cache enabled. Since

the L2 cache is responsible for the dominant portion of errors, the reliability of the modes is re-

ordered. AMP and SMP modes have the worst reliability compared to all of the FEFT modes.

115

Table 10-3. FPGA scrubbing duration.

Operation Duration (sec)

Readback (Entire FPGA) 14.5246

Readback (Frame) 0.001888

Writeback (Entire FPGA) 19.9478

Writeback (Frame) 0.002593

Table 10-4. Computational density device metrics.

Processor Computational Density (GOPS)

INT8 INT16 INT32

ARM Cortex-A9 Dual-Core 32.02 16.01 8.00

ARM Cortex-A9 Single-Core 16.01 8.00 4.00

MicroBlaze 0.125 0.125 0.125

HARFT Performance Modeling

We calculate device metrics, as described by Lovelly [61], using the theoretical

maximum performance for each of the three modes, illustrated in Table 10-4. While floating-

point calculations are available, this table only displays integer operations for brevity and to

compare with benchmark results, which are integer only. It should be noted that, while the SMP

mode may have lower reliability, it has dramatically increased performance over the FEFT

mode.

To provide an alternate view of processor performance, we benchmark with CoreMark

the featured processors. CoreMark is a benchmark developed by the Embedded Microprocessor

Benchmark Consortium with the goal of measuring the performance of embedded-system CPUs.

Table 10-5 displays the results of the benchmarks and confirms the theoretical trends calculated

for device metrics. We note that the L2 cache does not appear to improve the performance of the

CoreMark benchmark. This result is explained by the benchmark's low memory usage versus the

32KB level one (L1) instruction and data caches. The data and bss segments amount to about

16KB, which is about half of the capacity of the L1 data cache. The text segment is about 67KB,

116

which is larger than the size of the L1 instruction cache and may incur a slight performance

penalty due to cache misses justifying the differences in the results.

Table 10-5. Zynq processors’ CoreMark benchmarking performance.

Configuration Iterations/sec

Single-Core ARM with Caches Enabled 1980.2979

Single-Core ARM w/o L2 Cache 1971.2254

Single-Core ARM w/o Caches 116.9640

FPGA soft-core MicroBlaze 9.5975

We compile the benchmark used in Table 10-5 with "PERFORMANCE_RUN"

configuration and -O2 compiler optimizations. The number of iterations tested varied: 1000 for

MicroBlaze, 100,000 for ARM. This discrepancy results from the enormous time requirement for

the MicroBlaze to execute 100,000 iterations. The precision error is minimal at over 10 seconds

of execution.

There are no specific application accelerators developed thus far for the prototype. Since

the amount of speedup varies with application and hardware design, we assume that each PRR

accelerator adds 100 iterations/sec to highlight and establish the general-profile trends for the

reliability modes.

Figure 10-8 shows the reliability vs. performance of different modes depending on

varying fault rates with L2 cache disabled. We estimate the calculations for SMP mode by

doubling the single-core results of Table 10-5 (the CoreMark benchmark is single-threaded).

The graph highlights the Pareto optimal line for the varying configurations and indicates that it is

only useful to switch between AMP, SMP, and FEFT-Triplex when L2 cache is disabled.

117

Figure 10-8. Upsets per day vs. performance with L2 cache disabled.

Figure 10-9. Upsets per day vs. performance with L2 cache enabled

Figure 10-9 shows the same results with L2 cache enabled. The HPS shows drastically

higher performance, however, it is much more prone to upsets. FEFT-Duplex and FEFT-Simplex

are more viable in this configuration since they provide higher reliability than the HPS modes

while still maintaining higher performance than FEFT-Triplex. This chart illustrates the flexible

trade space for switching modes on a prototypical LEO mission.

Framework Status and Future Considerations

This chapter presents a novel, hybrid, fault-tolerant framework, HARFT, designed

specifically to adapt to the dual architecture capabilities and needs of SoC devices. We built a

118

specific HARFT configuration to test and verify the structural and design features as proof of

concept. HARFT features three dynamically configured modes: (1) SMP + Accelerators; (2)

AMP + Accelerators; and (3) FEFT + Sub-configurations. The benchmarking and reliability

analysis list these modes in order of the highest to lowest in performance, and from lowest to

highest in reliability. A custom-designed IP core, ConfigMan, simultaneously scrubs the FPGA

for faults, determines upset rate, and dynamically reconfigures the fault-tolerant mode. Our

experiments in this paper verify the functionality of the prototype, especially with regard to the

behavior of the processing system modes in AMP and SMP. The analysis highlights that, since

L2 cache is prone to upsets, the HARFT mode selection changes, depending on if the mission

designer enabled or disabled the L2 cache. Finally, with these methods on a hybrid SoC, a

spacecraft may adapt to changing environmental conditions in order to achieve a high level of

both performance and reliability for each mission scenario.

There are several features that we propose to improve the functionality and performance

of HARFT, which could be investigated in future development. Several of these additional

features are not complex; however, we did not include these features due to time restrictions.

Key additions include dynamic recovery of system by working processors, checkpointing of

system state, optimizing timing and FPGA performance, and finally, the use of machine

intelligence in ConfigMan for mode switching.

119

CHAPTER 11 CSP SUCCESSORS

The CSP concept is not restricted with respect to spacecraft size (i.e. CubeSat vs. large

satellite) or limited to a processing device (e.g. Xilinx Zynq); it is a design concept that can be

scaled and expanded to many scenarios. This section describes alternative application of the CSP

concept to other platforms beyond the CSPv1.

µCSP and Smart Modules

The CSP team decided to address processing and networking needs of future smart

modules (e.g., smart sensors, smart actuators), as well as improve computing capability for

lower-end CubeSats. Following the design principles proposed for hybrid space computing in

Chapter 4 the CSP team developed a new design known as µCSP [90]. Like CSPv1, µCSP is

designed with a hybrid mix of commercial and rad-hard components supplemented with

techniques in fault-tolerant computing. µCSP also features a hybrid-processor architecture

(Microsemi’s SmartFusion 2 M2S090), with a mix of fixed and reconfigurable logic (ARM

Cortex-M3 processor combined with a Flash-based FPGA fabric), but all in a smaller form factor

with lower SWaP-C. µCSP is smaller than a credit card and designed to integrate into (but not be

limited to) 1U SmallSat form factors. This section describes the design decisions and concepts

for the development of both µCSP and Smart Modules.

Concepts of Smart Modules

Despite CubeSats having a common mechanical structure, the internal hardware design

may drastically differ between implementations. Many CubeSats that are created are one-off

designs, specific to each mission and its requirements. While these designs are different, there

are design commonalities that must be present to guarantee functionality (e.g., power,

120

communications). µCSP enables the concept of “smart modules” to address these design

challenges.

The Smart Module concept has three main objectives:

1. Provide “smart” capability to each design slice

2. Achieve faster configuration and prototyping

3. Exploit reuse of designs through qualification

The smart-module system is a framework for designing a series of hardware platforms

that can be easily configured, integrated, and tested in preparation for a new mission. The main

idea is to construct a series of hardware “cards” or “slices” that have the desired sensors and

functionality while following the provided design template. Once the key sensors are identified,

they are placed and routed into a hardware card. This hardware card is designed using a baseline

template that features two high-density connectors, in the center of the board, a backplane

connector, and (optionally) two network (e.g., SpaceWire) connectors (that can also be routed

through the backplane). An example template is illustrated in Figure 11-1. The smart-module

framework also enables configurable distributed systems. Distributed configurations and

processing can apply within a single spacecraft, with space computers (e.g., CSPs) and smart

modules (e.g., instruments and actuators equipped with µCSPs). Wireless smart modules could

also be developed to promote networking and distributed systems across spacecraft.

Figure 11-1. Example template for Smart Module.

121

The two high-density connectors shown are used to attach our new low-power, hybrid

computer, µCSP, to the module. The card can also plug into a backplane board with the

backplane connector. This backplane connector provides power, ground, and bus

communications to each of the modules. A board connection and mating diagram is displayed in

Figure 11-2. Finally, the two SpaceWire connectors link each module to the board above and

below it, forming a ring network as seen in Figure 11-3.

Figure 11-2. Integration and mating with a Smart Module.

The µCSP present on each card provides a smart module with low-power processing.

µCSP can scale its power based upon the processing required for the node. One major benefit

from this design is that once a hardware card is developed, it can be placed anywhere in the

stack, due to the configurability of the connections. Once drivers and software are developed for

the card, the card is portable and can be reused to rapidly prototype or assemble entire flight

designs.

Example devices are elaborated by NASA Ames [59] with examples summarized in

Table 11-1 (e.g. Smart Thruster card).

122

Figure 11-3. Ring network connection for Smart Module.

As the “brain” for each smart module, µCSP allows designers to focus on their

application and not on low-level implementation. After more of these hardware cards are

developed, an inventory of designs is enabled that can be taken straight from “shelf-to-

spacecraft.”

Table 11-1. Example components for Smart Modules

Subsystem Example Components

Power Solar Cells

Batteries

Power Generator

Propulsion Thruster

Solar Sail

Communication Transmitters

Flight Terminal

Instruments Optical Spectrometer

Photometer

Particle Detector

Attitude Determination and Control Reaction Wheels

Magnetorquer

Control Moment Gyros

Star Track

Sun Sensors

GPS Receiver and Antennas

123

µCSP Hardware Architecture

µCSP is designed to attach to a 1U CubeSat form-factor board (Smart Module), through

two high-density connectors on the bottom. µCSP is roughly the size of a credit card (1.5" x

2.8”) and 63 mils thick. An isometric view of the prototype of this board is provided in Figure

11-4. All components for the board were purchased for an industrial temperature grade to support

a temperature ranging from –40°C to +85°C.

µCSP can operate at 50 to 100 mW in a low-power standby mode and can be awakened

with an interrupt. The nominal operational mode is estimated at 500 to 800 mW. Finally, we

estimate maximum power with full utilization of the ARM Microcontroller Subsystem (MSS)

and FPGA fabric at around 1 Watt.

Figure 11-4. µCSP computer board testing prototype.

This new, small space computer has several main communication interfaces and I/O pins

available. µCSP provides over 40 differential pairs (that can also be configured for single-ended

operation). The board features two interfaces each for UART, I2C, and SPI (4 slave-selects

each). With the PHYs placed on the Smart Module, the µCSP can support one CAN and one

USB2.0 interface. Our board has an Ethernet PHY to support 100 Mb/s connections, as well as 1

lane of PCI-Express. Finally, a JTAG interface is included to program and configure the device.

124

The inexpensive, commercial Emcraft SmartFustion2 System-on-Module (SoM)

development platform can be fully interfaced with any designs following the Smart Module

template. This approach allows Smart Module designs to be tested without a µCSP, solely using

the Emcraft SoM, providing a cost-effective means of creating a ground-system testbed and

performing verification. µCSP exhibits near complete pin compatibility with the SoM’s

evaluation board, albeit with some minor modifications. Finally, there are future plans for

“carrier cards” with commercial components, which can be placed into the radiation-hardened

footprints to assemble a commercial µCSP.

Table 11-2. Major components of µCSP.

Device Vendor Commercial / Radiation Hardened/Tolerant

Switching

Regulators

3D-Plus Radiation-Hardened

NOR Flash Aeroflex Radiation-Tolerant

Watchdog

Timer

Intersil Radiation-Hardened

SmartFusion2 Microsemi Commercial

LPDDR Intelligent

Memory

Commercial

Adhering to the CSP concept, µCSP includes both commercial and radiation-hardened

subsystems. Commercial components are featured for performance with low SWaP-C, and are

closely managed by radiation-hardened or -tolerant components. Table 11-2 shows the key

subsystem components in µCSP.

Microsemi’s SmartFusion2 is a powerful, hybrid device featuring an ARM Cortex-M3

processor combined with a flash-based FPGA fabric. µCSP employs the m2s090 model, which is

the most capable of the SmartFusion2 devices in a 484-pin package. Some key characteristics of

the selected device are listed in Table 11-3 and Table 11-4.

125

Table 11-3. SmartFusion2 ARM specifications.

ARM Specifications

Maximum Clock Frequency 166 MHz

Instruction Cache 8 KB

Embedded SRAM (eSRAM) 64 KB

Embedded Nonvolatile Memory

(eNVM)

512 KB

Table 11-4. SmartFusion2 FPGA specifications

FPGA Specification

Logic Elements 86,184

Math Blocks 84

SRAM Blocks 2074

µCSP Software Architecture

The featured technology on µCSP, the SmartFusion2 SoC, includes the ARM MSS

(Cortex-M3) as its built-in hardcore processor. The Cortex-M3 was specifically developed to

provide high performance at low power for microcontroller-type apps. This flexible platform can

easily support two popular operating systems. The first supported is uClinux, which is an

embedded Linux/Microcontroller project that ports Linux to systems that do not have a Memory

Management Unit (MMU). U-boot can be installed on the on-chip, non-volatile memory to load

uClinux and the root filesystem1. For apps that require determinism in execution, the Real-Time

Operating System (RTOS) known as FreeRTOS can be booted to the Cortex-M32.

Future work for µCSP involves integrating NASA Goddard’s open-source, flight-system

software, Core Flight Executive (cFE), and key supporting libraries and applications found in

their Core Flight System (cFS) to the SmartFusion2 in uClinux. Depending on availability and

progress, cFS developers have a project in progress called micro-cFE to develop a minimal cFS

1 http://www.uclinux.org/

2 http://www.freertos.org/SmartFusion2_RTOS.html

126

flight-software framework specifically targeting small payloads and CubeSats that could be used

in µCSP’s build system3.

µCSP Fault-Tolerant Architecture

µCSP includes fault-tolerance methods beyond its radiation-hardened and -tolerant

components. The FPGA fabric of the SmartFusion2 is flash-based, which significantly differs

from SRAM-based counterparts. While SRAM-based FPGAs are frequently affected by SEEs,

the reconfigurable flash cell is resilient against SEEs [87], which makes flash-based FPGAs

particularly useful for space-based apps.

µCSP includes a built-in hardware watchdog timer in the SmartFusion2, in addition to the

external, hardened watchdog device by Intersil. This external watchdog is critically important to

ameliorate radiation concerns for the operation of the SmartFusion2 in space. A whitepaper by

Microsemi [88] states:

“… tests indicate that the IGLOO2 FPGAs and SmartFusion2 FPGAs encounter non-

destructive latch-ups in heavy ion radiation testing, at energy levels low enough to cause concern

in low earth orbit (LEO) space applications”

This interim report was published in 2014, but was further investigated with additional

testing by Dsilva [89]. In that report, Single-Event Functional Interrupt (SEFI) behavior was

more closely studied and four different recovery mechanisms were studied to recover the MSS if

a SEFI occurred. These mechanisms included: (1) the MSS recovers by itself through time-out;

(2) the MSS built-in watchdog recovers; (3) reset is issued to recover the MSS; (4) a full power

cycle needed to recover. A full power cycle is required for certain components of the MSS for

recovery, and consequently the Intersil hardware watchdog on µCSP will perform this reset

3 http://www.coreflightsystem.org

127

function when triggered by lack of heartbeat from the SmartFusion2. Since watchdog reset of the

system may be required under certain upset conditions, µCSP is only recommended for missions

and flight applications where 100% availability is not a driving requirement.

SmartFusion2 also has several built-in reliability functions described by Microsemi4.

Single Error Correct Double Error Detect (SECDED) protection can be turned on for several

resources including Ethernet buffers, CAN message buffers, eSRAM, USB buffers, PCIe buffers,

and DDR memory controllers. There are also buffers with SEU-resistant latches including DDR

bridges, instruction cache, MMUART FIFOs, and SPI FIFOs. SmartFusion2 also has a built-in,

self-test (BIST) mechanism that can be used to check status of the device automatically upon

power-up or on demand. The BIST checks the contents of nonvolatile configuration memory,

security keys, settings, and ROM memory pages. Lastly, there is no external configuration

memory required to program and configure the device because it retains its configuration during

a power cycle. The flash fabric is resistant to power “drop outs” during configuration, which

would cause reliability issues for traditional SRAM-based FPGAs.

Smart Module Designs

In addition to the design of µCSP, several smart modules are in various stages of

development and planning to showcase the versatility of µCSP, act as initial examples of types

of smart modules to be created, and demonstrate a proof-of-concept, distributed space system. A

CubeSat can be rapidly constructed once a library of validated designs has been generated for

different smart module cards. This framework will significantly improve assembly and

preparation for CubeSat missions and allow nearly identical spacecraft to be rapidly created. This

system will allow configuration of a computing swarm with functionality distributed across

4 http://www.microsemi.com/products/fpga-soc/reliability/sf2-reliability

128

multiple CubeSats. The framework will also allow fast construction of replacement spacecraft in

the event of failures.

Another benefit of adhering to the Smart Module concept is reduction of extensive wiring

that is found in some spacecraft. Figure 11-5 illustrates an example of required wiring for a 6U

CubeSat. Smart modules place the processing intelligence closer to the sensor and actuators and

employ a unified communications system, therefore reducing the bulk of the wiring for power

and common communication interfaces.

Figure 11-5. Example of 6U CubeSat wiring harness.

The reduction of wiring has a multitude of benefits:

1. Reduces weight of spacecraft, thereby reducing cost by extension.

2. Decreases integration and test time involved with building, assembling, and testing the

wiring harness.

3. Simplifies debugging and emulation of a system; since each subsystem will be composed

of the same uCSP, there will be more design reuse and engineers will no longer have to

be familiar with multiple interface standards.

129

µCSP Achievement Highlights

There is a clear need for a high-performance computer for future CubeSat missions that

are limited by highly limited power systems. µCSP is a small, low-cost, and low-power space

computer designed to provide increased capability for SmallSat missions that need higher

performance and reliability despite severe resources constraints in size, weight, power, and cost.

Additionally, µCSP enables the realization of Smart Module in distributed space systems, which

can provide fast configuration of spacecraft for missions, improve productivity, and reduce

mission-specific redesign.

µCSP follows our original CSP Concept and features reconfigurable and multifaceted

hybrid computing, with a hybrid-system and a hybrid-processing architecture in a small form

factor. The µCSP hardware design, combined with a variety of fault-tolerant computing

techniques, running flight-system software, provides users with an optimal combination of

performance, energy efficiency, and reliability to satisfy a variety of space missions. Fast

assembly and replication of CubeSats is a key milestone in creating a distributed-computing

cluster for space, with functions distributed across different CubeSats, as well as developing

replacements for failed modules in the swarm.

SuperCSP and STP-H6/SSIVP

With the success of CSPv1, a survey was conducted to determine the next research step.

Polling key aerospace government and industry contacts, the general response was to extend the

capability of the CSPv1 design by developing a cluster of CSPv1 boards working cooperatively,

which focuses upon scalability of a well-tested board, instead of creating a new unverified

design. These groups described a need for a multiple-processor system featuring several boards

for both redundancy and to achieve performance targets, combined with a high-bandwidth

interconnect. Taking this need under consideration, the CSP team proposed to demonstrate a

130

networked cluster of space computers that can execute complex apps to bring ground-based

supercomputing capabilities to high-end space customers. The CSP team submitted several

proposals to: develop, demonstrate, and evaluate next-generation technologies for space

supercomputing, featuring image and video processing, with parallel, distributed, and

reconfigurable computing.

In December 2016, the CSP team was selected to fly the proposed experiment on STP-H6

as the Spacecraft Supercomputing for Image and Video Processing (STP-H6/SSIVP) mission. To

begin networking experimentation and to provide an initial development platform, the CSP team

developed the SuperCSP. The SuperCSP is an extensive evaluation card, or backplane, that

consists of four slots fitting CSPv1 boards. This design supports JTAG, USB-UART, Ethernet,

SpaceWire, and a variety of I/O to each of the connected boards. This system is shown in Figure

11-6.

Figure 11-6. SuperCSP backplane with 4 CSPv1s.

SSIVP is a novel experiment to advance the state of the art in space computing by

demonstrating the use of high-performance computing (HPC) techniques on a space platform.

SSIVP will feature several novel flight computers working as a cluster. The mission hardware

configuration consists of a 3U flight box. 1U of the box houses dual Camera Link cameras. The

131

remaining 2U features four flight-qualified CSPv1 Rev. B boards as compute nodes, one CSPv1

Rev. C as head node for cluster management, a µCSP board, a power board, an interface board,

and backplane interconnect. The deconstructed assembly is featured in Figure 11-7. Inter-CSP

communication is facilitated by high-performance, point-to-point links with networking protocol

(e.g., SerDes, SpaceWire) defined and configurable in the FPGA. The software configuration

incorporates NASA Goddard’s flight-software framework, core Flight System. However, the

cFE software bus supporting communication between all CSPv1 nodes has been redesigned with

innovative modifications. The communication backend of the original bus is replaced with OMG

Data Distribution Services (DDS) to support inter-node, publish-subscribe functionality. The

primary objective for this mission is to demonstrate and evaluate a novel framework for space-

based supercomputing with networked system-on-chip devices emphasizing high performance

and reliability. The completed flight unit prepared for environmental testing is pictured in Figure

11-8.

Figure 11-7. Deconstructed view of STP-H6/SSIVP flight box.

132

Figure 11-8. Fully assembled flight box for environmental testing.

CSPv2

The research for the CSPv2 is the natural future extension to the CSP family of research

platforms. CSPv2 research studies designs for broader capabilities in the larger tier of SmallSat

missions. One key limitation of the CSPv1 is the lack of multi-gigabit transceivers (MGTs)

required for several key interfaces and communication protocols (e.g. PCI-E, Serial RapidIO,

Aurora, SpaceVPX). Tentatively, Xilinx has announced both a commercial and radiation-tolerant

version of their new Zynq UltraScale+, which would be a main consideration for CSPv2 as the

hybrid nature of the device extends CSPv1 research. The Zynq UltraScale+ contains a Cortex-

A53 quad-core, a lockstep Cortex-R5 dual-core, and a power management unit. These new

features can add additional mission utility and enable future studies in hybrid and fault-tolerant

computing (e.g. asymmetric multiprocessing, hypervisors) such as experiments on isolated cores

for safety-critical systems and real-time operating restrictions.

133

CHAPTER 12 CONCLUSIONS

This dissertation introduces the CSP concept, a new approach for hybrid, reconfigurable

space computing that has the capacity to adaptively optimize for performance, reliability, and

power to suit a variety of mission needs. The purpose of the CSP concept is not constrained to

the production and design of the first flight board (CSPv1); it strives to foster the overarching

concept of hybrid space computing. SHREC is engaging in ongoing studies to determine which

components and features could be added to develop an even more robust space-processing

platform with CSPv2. Through the CSP concept, SHREC endorses a design framework that

embraces the best features of rad-hard and commercial designs, as well as fixed and

reconfigurable devices to achieve a compelling middle-ground solution for space computing.

To validate and analyze configurations and designs derived with the CSP concept, this

dissertation presents the SmallSat computer reliability modeling methodology. The methodology

provides a straightforward series of steps to create reliability metrics for an entire system design

that allows configurations to be compared to each other. The methodology incorporates on state-

of-the-art reliability tools and modeling software to provide estimates of system behavior in

different environmental conditions. In preparation for flight, the CSPv1 design was analyzed

with the methodology to assist in making critical modifications and estimate reliability of the

design.

Realization of the CSP concept for flight has been fully achieved with the STP-H5/CSP

mission. This mission serves as a technology readiness level (TRL) advancement and space

validation for the CSPv1 board and its supporting software. In this research mission, we have

successfully demonstrated a transition from TRL1 to TRL9 on a design for SmallSat

applications. For the foreseeable mission duration, valuable radiation data and upset rates to the

134

CSPv1 boards will be collected to gain further insight into hybrid-space design leading to future

designs and improvements. STP-H5/CSP regularly sends down health and status data along with

thumbnail images taken periodically by the image sensor. Downloaded logs and received packet

playback features of the ground software allow the CSP team to analyze the behavior of STP-

H5/CSP. So far in the mission, the observed upset rates due to radiation have been dramatically

lower than predicted rates generated from the orbital model. The STP-H5/CSP flight box is the

first venture into exploring the capabilities of the CSPv1 flight board and the CSP concept in a

real space environment.

Finally, with the success of the STP-H5/CSP mission, this dissertation highlights the

challenges of fault-tolerant strategies for hybrid designs. This dissertation shows the adaptation

and enhancement of a previous CHREC center research project known as Reconfigurable Fault

Tolerance (RFT)), into the Hybrid, Adaptive, Reconfigurable Fault Tolerance (HARFT)

framework, designed specifically to provide a system-level, fault-tolerant framework for SoC-

based designs. A prototype for this design was developed suitable for integration with the CSPv1

design. Final plans for this work involve validating this hybrid fault-tolerant framework in space

as an upload to STP-H5/CSP or incorporated as part of STP-H6/SSIVP.

In conclusion, this dissertation presents a new concept for space-system hybrid designs, a

methodology for analyzing these hybrid designs, and finally, a novel framework to employ fault-

tolerance over the hybrid designs. The success of STP-H5/CSP is the preliminary catalyst

proving these concepts and provides examples that will help advance studies for the next

generation of space processors for even more advanced missions.

135

APPENDIX

SPACE PROCESSORS

Table A-1. SmallSat processors and Single-Board Computers

Device Vendor Device Type SBC Vendor SBC Missions

Actel ProASIC3 FPGA Xiphos Q7, Q6 ACES RED #1, GHGSat-D

Atmel AT91SAM7A1 Microcontroller GomSpace Nanomind A712D STRaND-1

Atmel AT697E Microprocessor SwRI SC-SPARC8 Instrument Controller JUNO, Solar Orbiter, MMS

Atmel AT91SAM9G20 Microcontroller Tyvak Intrepid INCA

Atmel AT32UC3C Microcontroller GomSpace Nanomind A3200 GOMX-3, Dellingr

Atmel AT91SAM9G20 Microcontroller ISIS OBC QB50p1 p2, IL-02, TW-01, CN-

01, BE-06, PEASSS

Atmel ATmega329P Microcontroller NanoSatisfi Inc. ArduSat Kit ArduSat 1

Broadcom BCM2835/6/7 SoC RaspberryPi.org Raspberry Pi Modules Pi-Sat, NODeS

Cobham UT699 Microprocessor SEAKR SBC Orion VPU

Cobham Gaisler LEON3FT Microprocessor SDL MODAS Bus Interface Module

Cobham Gaisler GR712RC Microprocessor SwRI Centaur CYGNSS, CuSP, NASA Mission

Avionics, Undisclosed

Cobham UT699 Microprocessor SwRI FT Spacecraft/Instrument

Controller

JUNO, FERMI, Kepler, DoD

Mission

Freescale P2020 Microprocessor Space Micro Proton400k ORS-1

Microchip PIC24FJ256GA110 Microcontroller Pumpkin PSPM D MiRaTA, MicroMAS-1,

FIREBIRD-I

Microchip PIC24F256GB210 Microcontroller Pumpkin PSPM E

Microchip PIC24FJ256GA110 Microcontroller Pumpkin PPM D1 Caerus/Mayflower, DICE-1,

DICE-2, Aeneas

Microchip dsPIC33FJ256GP710 Microcontroller Pumpkin PPM D2 CINEMA

Microchip ATmegaS128 Microcontroller Undisclosed Undisclosed Undisclosed

Microsemi SmartFusion 2 SoC Clyde Space OBC

Microsemi SmartFusion 2 SoC NSF SHREC ctr. µCSP STP-H6/SSIVP

Nvidia Tegra SoC Innoflight TFLOP Undisclosed

NXP MPC8548E Microprocessor Aitech SP0 MUSES

NXP MPC7457 Microprocessor SEAKR G4 Artemis TacSat 3

NXP MPC8548E Microprocessor SwRI High-Performance SBC Undisclosed

Silicon Labs EFM32GG280F1024 Microcontroller CubeSpace CubeComputer

Silicon Labs C8051F120 Microcontroller Pumpkin PSPM B

Silicon Labs C8051F120 Microcontroller Pumpkin PPM B1 QbX1, QbX2

Sitara AM3359AZCZ100 Microprocessor BeagleBoard.org BeagleBone Black

(Rev C) RADSat, ANDESITE, TRYAD

136

Table A-1. Continued

Device Vendor Device Type SBC Vendor SBC Missions

Texas

Instruments Sitara AM3703 Microprocessor Gumstix Overo EarthSTORM IPEX

Texas

Instruments OMAP3530 Microprocessor Gumstix Overo Water DM7

Texas

Instruments MSP430F1612 Microcontroller Pumpkin PPM A1 CSSWE

Texas

Instruments MSP430F1611 Microcontroller Pumpkin PPM A2

Texas

Instruments MSP4302618 Microcontroller Pumpkin PPM A3

Texas

Instruments

MSP430F149/169/161

1/1612 Microcontroller Pumpkin FM430

Delfi-C3, HawkSat-1, ITU-

pSAT1, AIS Pathfinder 2,

GOLIAT, e-st@r,Libertad-1

Texas

Instruments TI320C6713DSP DSP SDL MODAS CPU Module

Texas

Instruments TI 320C6XXXDSP DSP Space Micro Proton200k

MDA MISTI, Goodrich,

QuickReach

Xilinx Zynq 7030 SoC GomSpace NanoMind Z7000 GOMX-3

Xilinx Zynq 7020 SoC Innoflight CFC-300 Undisclosed Xilinx Zynq 7045 SoC Innoflight INNOF6TP Undisclosed

Xilinx Ultrascale+ SoC Innoflight CHAMPS Flight Computer Undisclosed

Xilinx Artix-7 FPGA MSU N/A RadSat Xilinx Zynq 7045 FPGA Raytheon S3OP

Xilinx Virtex-7 FPGA Space Micro Proton300k ORS-1, Undisclosed, TESS

Xilinx Zynq 7020 SoC NSF SHREC ctr./ Space Micro

CSPv1 STP-H5, SkyFIRE, CeREs, Luna-H

Xilinx XCV800 FPGA Surrey Space Center SSTRL OBC

Xilinx Virtex-4 FPGA Tohoku University MPU RAIKO Xilinx Zynq 7020 SoC Xiphos Q7 ACES RED #1, GHGSat-D

Xilinx Spartan-6 FPGA Xiphos Q6 OSTEO-4

Xilinx Virtex II-Pro SoC Xiphos Q5 Genesis-1, Genesis-2

Xilinx UltraScale+ SoC Innoflight TFLOP Undisclosed

Xilinx V5-QV FPGA NASA GSFC SpaceCube 2.0 RRM3, Restore-L, NEODaC, RAVEN,

XCOM

137

LIST OF REFERENCES

[1] Salas, A., Attai, W., Oyadomari, K., Priscal, C., Shimmin, R., Gazulla, O. and Wolfe, J.,

“Phonesat In-Flight Experience Results,” NASA Rept. ARC-E-DAA-TN14625, May 2014.

[2] Allmen, J., and Petro, A., “Small Spacecraft Technology,” Proceedings of the AIAA/USU

Conference on Small Satellites, http://digitalcommons.usu.edu/smallsat/2014/Workshop/10/

[retrieved Mar. 2017].

[3] Brown, O., and Eremenko, P., “Fractionated Space Architectures: A Vision for Responsive

Space,” AIAA Paper 2006-1002, 2006. doi: 10.2514/6.2006-7506

[4] The Role of Small Satellites in NASA and NOAA Earth Observation Programs,” National

Academies Press, 2000. doi: 10.17226/9819

[5] Earth Science and Applications from Space: A Midterm Assessment of NASA's

Implementation of the Decadal Survey, National Academies Press, Aug. 2012.

doi: 10.17226/13405

[6] Earth Science and Applications from Space: National Imperatives for the Next Decade and

Beyond, National Academies Press, 2007. doi: 10.17226/11820

[7] Achieving Science with CubeSats, National Academies Press, Oct. 2016.

doi: 10.17226/23503

[8] Doncaster, B., Williams, C., and Shulman, J., “2017 Nano/Microsatellite Market Forecast,”

SpaceWorks Enterprises, Inc. [online], Atlanta, GA, 2017,

http://spaceworksforecast.com/docs/SpaceWorks_Nano_Microsatellite_Market_Forecast_2017.p

df [retrieved Aug. 2017]

[9] NASA Office of the Chief Technologist, “2015 NASA Technology Roadmaps,” NASA

Headquarters [online], Washington, D.C., 2015,

https://www.nasa.gov/offices/oct/home/roadmaps/index.html [retrieved Aug. 2017]

[10] NASA Space Technology Roadmaps and Priorities Revisited, National Academies Press,

2016. doi: 10.17226/23582

[11] Hyten, J., “Small Satellite 2015 Keynote Speech,” 29th Annual AIAA/USU Conference on

Small Satellites [Online], Logan, UT, August 8-13, 2015, Available:

http://www.afspc.af.mil/About-Us/Leadership-

Speeches/Speeches/Display/Article/731705/small-satellite-2015-keynote-speech/ [retrieved Aug.

2017]

[12] Rudolph, D., Wilson, C., Stewart, J., Gauvin, P., George, A. D., Lam, H., Crum, G.,

Wirthlin, M., Wilson, A., and Stoddard, A., “CSP: A Multifaceted Hybrid Architecture for Space

Computing,” Proceedings of the AIAA/USU Conference on Small Satellites,

http://digitalcommons.usu.edu/smallsat/2014/AdvTechI/3/ [retrieved Sep. 2014]

138

[13] Fleetwood, D.M., and Winokur, P.S., “Radiation Effects in the Space Telecommunications

Environment,” Proceedings of the 22nd International Conference on Microelectronics, IEEE

Publ., Piscataway, NJ, 2000. doi:10.1109/23.736521

[14] Sexton, F. W., “Destructive single-event effects in semiconductor devices and ICs,” IEEE

Transactions on Nuclear Science, Vol. 50, No. 3, June 2003, pp. 603–621.

doi: 10.1109/tns.2003.813137

[15] Maurer, R. H., Fraeman, M. E., Martin, M. N., and Roth, D. R., “Harsh environments:

Space radiation environment, effects, and mitigation,” Johns Hopkins APL Technical Digest

[online journal], Vol. 28, No. 1, 2008, pp. 1-17,

http://techdigest.jhuapl.edu/TD/td2801/Maurer.pdf [retrieved Mar. 2017].

[16] Edmonds, L.D, Barnes, C. E., and Scheick, L. Z. “An Introduction to Space Radiation

Effects on Microelectronics,” NASA Rept. JPL00-62, May 2000.

[17] Space Radiation Effects on Electronic Components in Low-Earth Orbit, NASA Rept. PD-

ED-1258, Apr. 1996.

[18] Ladbury, R. "Radiation hardening at the system level." Proceedings of the Nuclear and

Space Radiation Effects Conference, IEEE Publ., Piscataway, NJ, 2007, pp. 1-94,

https://radhome.gsfc.nasa.gov/radhome/papers/nsrec07_sc_ladbury.pdf [retrieved March 2017].

[19] Schwank, J. R., Shaneyfelt, M. R., and Dodd, P. E., “Radiation hardness assurance testing

of microelectronic devices and integrated circuits: Radiation environments, physical

mechanisms, and foundations for hardness assurance,” IEEE Transactions on Nuclear Science,

Vol. 60, No. 3, Jun. 2013, pp. 2074–2100. doi: 10.1109/tns.2013.2254722

[20] LaBel, K. A., Johnston, A. H., Barth, J. L., Reed, R. A., and Barnes, C. E., “Emerging

radiation hardness assurance (RHA) issues: a NASA approach for space flight programs,” IEEE

Transactions on Nuclear Science, Vol. 45, No. 6, 1998, pp. 2727–2736. doi:10.1109/23.736521

[21] LaBel, K. A., Marshall, P. W., Gruner, T. D., Reed, R. A., Settles, B., Wilmot, J.,

Dougherty, L. F., Russo, A., Foster, M. G., Yuknis, W., and Garrison-Darrin, A., "Radiation

Evaluation Method of Commercial Off-the-Shelf (COTS) Electronic Printed Circuit Boards

(PCBs)," Proceedings of the 5th European Conference on Radiation and its Effects on

Components and Systems, Sep. 13-17, 1999. doi: 10.1109/RADECS.1999.858637

[22] Single Event Effect Criticality Analysis, NASA Rept. 431-REF-000273, Feb. 1996.

[23] Sampson, M. and LaBel. K. A., “The NASA Electronic Parts and Packaging (NEPP)

Program – Overview for FY14,” Space Parts Working Group (SPWG), Torrance, CA, April 21-

22, 2015, https://nepp.nasa.gov/files/25938/NEPP-FY14-SPWG-2014.pdf [retrieved Oct. 2015]

[24] Sahu, K., “EEE-INST-002: Instructions for EEE Parts Selection, Screening, Qualification,

and Derating,” NASA Rept. TP-2003-212242, Apr. 2008.

139

[25] Stamatelatos, M., “Probabilistic Risk Assessment: What is it and why is it worth performing

it?” Office of Safety and Mission Assurance NASA HQ [online], Washington, D.C., Apr. 2000,

http://www.hq.nasa.gov/office/codeq/qnews/pra.pdf [retrieved Oct. 2015]

[26] Vesely, W., Stamatelatos, M., Dugan, J., Fragola, J., Minarick, J., and Railsback J., “Fault

Tree Handbook with Aerospace Applications,” NASA Office of Safety and Mission Assurance

NASA HQ [Online], Washington, D.C., Aug. 2002,

http://www.hq.nasa.gov/office/codeq/doctree/fthb.pdf [retrieved Oct. 2015]

[27] Boudali, H., Crouzen, P., and Stoelinga, M., “Dynamic Fault Tree analysis using

Input/Output Interactive Markov Chains,” Proceedings of the 37th Annual IEEE/IFIP

International Conference on Dependable Systems and Networks, IEEE Publ., Jun. 2007.

doi: 10.1109/DSN.2007.37

[28] Dugan, J. B., Bavuso, S. J., and Boyd, M. A., “Dynamic fault-tree models for fault-tolerant

computer systems.” IEEE Transactions on Reliability, Vol. 41, No. 3, Sep. 1992, pp 363-377.

doi: 10.1109/24.159800

[29] Manian, R., Dugan, J.B., Coppit D., and Sullivan, K. J., "Combining Various Solution

Techniques for Dynamic Fault Tree Analysis of Computer Systems,” IEEE Transactions

International High-Assurance Systems Engineering Symposium, Vol. 3, pp. 21-28, Nov. 1998.

Doi: 10.1109/HASE.1998.731591

[30] Ladbury R., “Statistical Modeling for Radiation Hardness Assurance,” Hardened

Electronics and Radiation Technology Conference, Huntsville, AL, Mar 17-21, 2014,

https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20140008964.pdf [retrieved Oct. 2015]

[31] Gulati, R., and Dugan, J. B., “A modular approach for analyzing static and dynamic fault

trees", Proceedings of Reliability and Maintainability Symposium, Philidelphia, PA, Jan. 1997,

pp. 57-63. doi: 10.1109/RAMS.1997.571665

[32] Kohn, C., “Partial Reconfiguration of a Hardware Accelerator with Vivado Design Suite for

Zynq-7000 AP SoC Processor,” Xilinx Application Note, Xilinx Incorporated, XAPP1231, Mar.

20, 2015, https://www.xilinx.com/support/documentation/application_notes/xapp1231-partial-

reconfig-hw-accelerator-vivado.pdf [retrieved Oct. 2016]

[33] Vivado Design Suite User Guide Partial Reconfiguration, Xilinx User Guide, UG909, Apr.

2016, http://www.xilinx.com/support/documentatio n/sw_manuals/xilinx2016_1/ug909-vivado-

partial-reconfiguration.pdf [retrieved Oct. 2016]

[34] Berg, M., and LaBel, K. A., “Introduction to FPGA Devices and The Challenges for Critical

Application: A User’s Perspective,” Hardened Electronics and Radiation Technology

Conference, Chantilly, VA, April 21-24, 2015, https://nepp.nasa.gov/files/27968/NEPP-CP-

2015-Berg-Presentation-HEART-TN22657.pdf [retrieved Oct. 2016]

[35] Wirthlin, M., “High-Reliability FPGA-Based Systems: Space, High-Energy Physics, and

Beyond,” Proceedings of the Institute of Electrical and Electronics Engineers, Vol. 103, No. 3,

pp. 379–389, Mar. 2015. doi: 10.1109/JPROC.2015.2404212

140

[36] Jacobs, A., Cieslewski, G., George, A. D., Gordon-Ross, A., and Lam, H., “Reconfigurable

fault tolerance: A comprehensive framework for reliable and adaptive FPGA-based space

computing,” ACM Transactions on Reconfigurable Technology and Systems, Vol. 5, No. 4, pp.

1–30, Dec. 2012. doi: 10.1145/2392616.2392619

[37] Saleh, R., Mirabbasi, S., Lemieux, G., and Grecu, C., “System-on-Chip: Reuse and

Integration,” Proceedings of the Institute of Electrical and Electronics Engineers, Vol. 94, No. 6,

pp. 1050–1069, June 2006. doi: 10.1109/JPROC.2006.873611

[38] NVIDIA Tegra K1 A New Era in Mobile Computing, NVIDIA Corporation, Jan. 2014,

http://www.nvidia.com/content/PDF/tegra_white_papers/Tegra_K1_whitepaper_v1.0.pdf

[retrieved Aug. 2017]

[39] Zynq-7000 All Programmable SoC Data Sheet: Overview, Xilinx Data Sheet, DS190

(v1.11), Jun. 2017, https://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-

7000-Overview.pdf [retrieved Aug. 2017]

[40] 66AK2Hxx Multicore DSP+ARM KeyStone II System-on-Chip (SoC), Texas Instruments

Incorporated. SPRS866F (Revision F), Jun. 2017,

http://www.ti.com/lit/ds/symlink/66ak2h06.pdf [retrieved Aug. 2017]

[41] Jacobs, A., Conger, C., and George, A. D., “Multiparadigm Space Processing for

Hyperspectral Imaging,” Proceedings of the IEEE Aerospace Conference, IEEE Publ.,

Piscataway, NJ, March 2008. doi: 10.1109/AERO.2008.4526468

[42] Koren, I., and Krishna, C. M., Fault Tolerant Systems, Morgan Kaufman Publishers Inc.,

San Francisco, CA, 2007.

[43] Robinson, W. H., Alles, M. L., Bapty, T. A., Bhuva, B. L., Black, J. D., Bonds, A. B.,

Massengill, L. W., Neema, S. K., Schrimpf, R. D., and Scott, J. M., “Soft Error Considerations

for Multicore Microprocessor Design,” Proceedings of IEEE International Conference on

Integrated Circuit Design and Technology, IEEE Publ., Piscataway, NJ, May 2007.

doi: 10.1109/ICICDT.2007.4299574

[44] Quinn, H., Fairbanks, T., Tripp, J., Duran, G. and Lopez, B., "Single-event effects in low-

cost, low-power microprocessors", Proceedings of IEEE Radiation Effects Data Workshop, July

2014. doi: 10.1109/REDW.2014.7004596

[45] Andraka, R. J., Brady, P. E., and Brady, J. L., “A Low Complexity Method for Detecting

Configuration Upset in SRAM based FPGAs,” 6th Proc. Military and Aerospace Programmable

Logic Devices Conference, Washington, D.C., Sept. 9-11, 2003,

http://andraka.com/files/seu_mapld_2003.pdf [retrieved Oct. 2016]

[46] McNutt, S., “AMP up Your Next SoC Project,” Xcell Software Journal [online journal],

Xilinx Incorporated, No. 3, pp 28-33, First Quarter, 2016, https://forums.xilinx.com/t5/Xcell-

Daily-Blog/AMP-up-Your-Next-SoC-Project/ba-p/699265 [retrieved Oct. 2016]

141

[47] Zynq-7000 All Programmable SoC Software Developers Guide, Xilinx User Guide, UG821,

Sep. 2015, http:// www.xilinx.com/support/documentation/user_guides/ug821-zynq-7000-

swdev.pdf [retrieved Oct. 2016]

[48] Taylor, A., “A Double-Barreled Way to Get the Most from Your Zynq SoC,” Xcell Journal

[online journal], No. 90, pp. 38-45, First Quarter, 2015, https://forums.xilinx.com/t5/Xcell-Daily-

Blog/A-Double-Barreled-Way-to-Get-the-Most-from-Your-Zynq-SoC/ba-p/584328 [retrieved

Oct. 2016]

[49] Greb, K., “Design How-To: Matching processor safety strategies to your system design,”

EE Times [online journal], Oct. 2011, https://www.eetimes.com/document.asp?doc_id=1279168

[retrieved Oct. 2016]

[50] MicroBlaze Processor Reference Guide, Xilinx User Guide, UG984, Apr. 2014,

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2014_2/ug984-vivado-

microblaze-ref.pdf

[51] LaMeres, B J., Harkness, S., Handley, M., Moholt, P., Julien, C., Kaiser, T., Klumpar, D.,

Mashburn, K., Springer, L., and Crum, G., “RadSat – Radiation Tolerant SmallSat Computer

System,” Proceedings of the AIAA/USU Conference on Small Satellites,

https://digitalcommons.usu.edu/smallsat/2015/all2015/69/

[52] Cobb, S., “The Department of Defense Space Test Program,” Proceedings of the AIAA/USU

Conference on Small Satellites, http://digitalcommons.usu.edu/smallsat/1990/LaunchII/

[53] Sims, E., “The Department of Defense Space Test Program: Come Fly with Us,”

Proceedings of the IEEE Aerospace Conference, IEEE Publ., Piscataway, NJ, March 2009.

doi: 10.1109/AERO.2009.4839351

[54] McLeroy, J., “Highlights of DoD Research on the ISS,” ISS Research and Development

Conference, June 26-27, 2012,

http://astronautical.org/sites/default/files/issrdc/2012/issrdc_2012-06-27-0815_mcleroy.pdf

[55] Petrick, D, “SpaceCube Technology Brief Hybrid Data Processing System,” NASA Rept.

GSFC-E-DAA-TN38741, Nov. 2016.

[56] Lin, M., Flatley, T., Geist, A., and Petrick, D., “NASA GSFC Development of the

SpaceCube Mini,” Proceedings of the AIAA/USU Conference on Small Satellites,

http://digitalcommons.usu.edu/smallsat/2011/all2011/73/

[57] 2017 High-Risk Report, Congressional Committees Report, GAO-17-317, Washington,

D.C., Feb. 2017, http://www.gao.gov/assets/690/682765.pdf [retrieved Aug. 2017]

[58] Cheeks, N., “CubeSat/SmallSat,” NASA Goddard Tech Transfer News [online journal],

Vol. 12, No. 3, 2014, pp. 7-15, https://partnerships.gsfc.nasa.gov/wp-

content/uploads/TechTransfer-Summer2014.pdf [retrieved Aug. 2017]

142

[59] NASA Ames Research Center Mission Design Division, ‘‘Small Spacecraft Technology

State of the Art,’’ NASA Rept. TP-2015-216648/REV, Jul. 2017.

[60] RAD750® family of radiation-hardened products, BAE Systems Product Guide, CS-16-

F80, Jul. 2008, https://www.baesystems.com/en/download-

en/20161103152724/1434555689265.pdf [retrieved Aug. 2017]

[61] Lovelly, T. M., and George, A. D., “Comparative Analysis of Present and Future Space-

Grade Processors with Device Metrics,” AIAA Journal of Aerospace Information Systems, Vol.

14, No. 3, Mar. 2017, pp. 184-197. doi: 10.2514/1.I010472

[62] Ginosar, R., “Survey of Processors for Space,” Proceedings of the Data Systems in

Aerospace, May 2012, http://www.ramon-chips.com/papers/SurveySpaceProcessors-

DASIA2012-paper.pdf [retrieved Aug. 2017]

[63] Rodgers, C., Barnhart, D., and Crago, S., “Maestro Flight Experiment: A 49-core radiation

Proceedings of the IEEE Aerospace Conference, IEEE Publ., Piscataway, NJ, March 2016.

doi: 10.1109/AERO.2016.7500626

[64] Cockrell, J., Yost, B., and Petro, A., “PhoneSat – The Smartphone Nanosatellite,” NASA

Rept. FS-2013-04-11-ARC, ARC-E-DAA-TN9822, Jun. 2013,

https://ntrs.nasa.gov/search.jsp?R=20140005553

[65] Swartwout, M. A., "CubeSats and Mission Success: 2017 Update," NASA Electronics Parts

and Packaging Program - Electronics Technology Workshop, NASA Goddard Space Flight

Center, Greenbelt, MD, June 27, 2017,

https://drive.google.com/open?id=0B_YNiLtqhzSqZ3dXdmRKc1ROWUE [retrieved Aug.

2017]

[66] Johnson, M., Beauchamp, P., Schone, H., Sheldon, D., Fuhrman, L., Sullivan, E., Fairbanks,

T., Moe, M., and Leitner, J., “Increasing Small Satellite Reliability- A Public-Private Initiative,”

Proceedings of the AIAA/USU Conference on Small Satellites,

https://digitalcommons.usu.edu/smallsat/2017/all2017/30/

[67] Lee, D., Wirthlin, M., Swift, G., and Le, A., “Single-Event Characterization of the 28 nm

Xilinx Kintex-7 Field-Programmable Gate Array under Heavy Ion Irradiation,” Proceedings of

IEEE Radiation Effects Data Workshop, July 2014. doi: 10.1109/REDW.2014.7004595

[68] Guertin, S. “Candidate Cubesat Processors,” NASA Electronic Parts and Packaging

Program - EEE Parts for Small Missions Workshop, NASA Goddard Space Flight Center,

Greenbelt, MD, September 10-11, 2014,

https://nepp.nasa.gov/workshops/eeesmallmissions/talks/11%20-%20THUR/1350%20-

%20CubesatMicroprocessor_V1.pdf [retrieved Aug. 2017]

[69] Avery, K., Fenchel, J., Mee, J., Kemp, W., Netzer, R., Elkins, D., Zufelt, B., and Alexander,

D., “Total Dose Test Results for CubeSat Electronics,” Proceedings of IEEE Radiation Effects

Data Workshop, July 2011.

143

[70] Clagett, C., Santos, L., Azimi, B., Cudmore, A., Marshall, J., Starin, S., Sheikh, S., Zesta,

E., Paschalisdis, N., Johnson, M., Kepko, L., Barry, D., Bonalsky, T., Chai, D., Colvin, M.,

Evans, A., Hesh, S., Jones, S., Peterson, Z., Rodriguez, J., Rodriguez, M., “Dellingr: NASA

Goddard Space Center’s First 6U Spacecraft”, Proceedings of the AIAA/USU Conference on

Small Satellites, https://digitalcommons.usu.edu/smallsat/2017/all2017/83/

[71] Wilson, C., George, A. D., and Klamm, B., “A methodology for estimating reliability of

SmallSat computers in radiation environments,” Proceedings of the IEEE Aerospace Conference,

IEEE Publ., Piscataway, NJ, March 2016. doi: 10.1109/AERO.2016.7500605

[72] Oldham, T., Poivey, C., Buchner, S., Kim, H., Friendlich, M., and Berg, M., “HI SEE

Report for the Hynix, Micron, and Samsung 4Gbit NAND Flash Memories,” Radiation Test

Report Summary, NASA Goddard Space Flight Center, Aug. 2007,

http://radhome.gsfc.nasa.gov/radhome/papers/T052207_Hynix_Micron_Samsung.pdf

[73] Wilson, C., Sabogal, S., George, A. D., and Gordon-Ross, A., “Hybrid, adaptive, and

reconfigurable fault tolerance,” Proceedings of the IEEE Aerospace Conference, IEEE Publ.,

Piscataway, NJ, March 2017. doi: 10.1109/AERO.2017.7943867

[74] Engel, J., Wirthlin, M., Morgan, K., and Graham, P., “Predicting On-Orbit Static Single

Event Upset Rates in Xilinx Virtex FPGAs,” Proceedings of Military and Aerospace

Programmable Logic Devices Conference, September 2006.

[75] Barak, J., Reed, R. A., and LaBel, K. A., “On the figure of merit model for SEU rate

calculations,” IEEE Transactions on Nuclear Science, Vol. 46, No. 6, Dec. 1999, pp. 1504-1510.

doi: 10.1109/23.819114

[76] Petersen, E. L., “The SEU figure of merit and proton upset rate calculations,” IEEE

Transactions on Nuclear Science, Vol. 45, No. 6, Dec. 1998, pp. 2550-2562.

doi: 10.1109/23.736497

[77] Foucard, G., “Handbook of Mitigation techniques against Radiation Effects for ASICs and

FPGAs.” CERN [online], Jan. 2012,

http://indico.cern.ch/event/169035/contribution/4/attachments/208507/292405/Presentation_CER

N.pdf

[78] Wilson, C., Stewart, J., Gauvin, P., MacKinnon, J., Coole, J., Urriste, J., George, A., Crum,

G., Timmons, E., Beck. J., Flatley, T., Wirthlin, M., Wilson, A., and Stoddard, A., “CSP Hybrid

Space Computing for STP-H5/ISEM on ISS,” Proceedings of the AIAA/USU Conference on

Small Satellites, http://digitalcommons.usu.edu/smallsat/2015/all2015/21/

[79] Buchner, S., Kanyogoro, N., McMorrow,D, Foster, C., C., O'Neill, P. M., and Nguyen, K.

V., "Variable Depth Bragg Peak Method for Single Event Effects Testing," IEEE Transactions

on Nuclear Science, Vol. 58, No. 6, Dec. 2011, pp. 2976-2982. doi: 10.1109/TNS.2011.2170587

[80] Coole, J. and Stitt, G., “Fast and Flexible High-Level Synthesis from OpenCL using

Reconfiguration Contexts,” IEEE Micro, Vol. 34, No. 1, Jan. 2014, pp. 42-53.

doi: 10.1109/MM.2013.108

144

[81] Coole, J. and Stitt, G., “Adjustably Flexible, Low-Overhead Overlays for Runtime FPGA

Compilation,” Proceedings of the 22nd Annual IEEE International Symposium on Field-

Programmable Custom Computing Machines, IEEE Publ., Piscataway, NJ, May 2015.

doi: 10.1109/FCCM.2015.49

[82] McDougall, J., “Simple AMP Running Linux and Bare-Metal System on Both Zynq SoC

Processors,” Xilinx Application Note, XAPP1078, Feb. 2013,

http://www.xilinx.com/support/documentation/application_notes/xapp1078-amp-linux-bare-

metal.pdf [retrieved Oct. 2016]

[83] McDougall, J., “Simple AMP: Bare-Metal System Running on Both Cortex-A9 Processors,”

Xilinx Application Note, XAPP1079, Jan. 2014,

http://www.xilinx.com/support/documentation/application_notes/xapp1079-amp-bare-metal-

cortex-a9.pdf [retrieved Oct. 2016]

[84] McDougall, J., “Simple AMP: Zynq SoC Cortex-A9 Bare-Metal System with MicroBlaze

Processor,” Xilinx Application Note, XAPP1093, Jan. ,2014,

http://www.xilinx.com/support/documentation/application_notes/xapp1093-amp-bare-metal-

microblaze.pdf [retrieved Oct. 2016]

[85] Stoddard, A., “Configuration Scrubbing Architectures for High Reliability FPGA Systems,”

BYU Scholars Archive All Theses and Dissertations [online journal], Paper 5704, 2015,

https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=6703&context=etd

[86] Wilson, A., Wilson, A., and Wirthlin, M., “Neutron Testing of the Linux Kernel Operating

on the Zynq SOC,” Proceedings of the International Workshop on FPGAs for Aerospace

Applications, 2015.

[87] Wang, J., Rezzak, N., Dsilva, D., Huang, C. and Lee, K., “Single Event Effects

Characterization in 65 nm Flash-Based FPGA-SOC,” Proceedings of the Single Event Effects

Symposium, San Diego, CA, May 2014.

[88] Microsemi Corporation, “IGLOO2 and SmartFusion2 65nm Commercial Flash FPGAs

Interim Summary of Radiation Test Results” G4 Radiation Summary Interim Report, 51000013-

2/10.14, October 2014, http://www.microsemi.com/document-portal/doc_view/134103-igloo2-

and-smartfusion2-fpgas-interim-radiation-report [retrieved June 2016]

[89] N., Dsilva, D., Wang, J., and Jat, N. “SET and SEFI Characterization of the 65 nm

SmartFusion2 Flash-Based FPGA under Heavy Ion Irradiation,” ,” Proceedings of IEEE

Radiation Effects Data Workshop, July 2015. doi: 10.1109/REDW.2015.7336733

[90] Wilson, C., MacKinnon, J., Gauvin, P., Sabogal, S., George, A. D., Crum, G., and Flatley,

T., “μCSP: A Diminutive, Hybrid, Space Processor for Smart Modules and CubeSats,”

Proceedings of the AIAA/USU Conference on Small Satellites,

http://digitalcommons.usu.edu/smallsat/2016/TS10AdvTech2/1/

145

BIOGRAPHICAL SKETCH

Christopher Mark Wilson received the Bachelor of Science in Computer Engineering in

2011, Master of Science in Electrical and Computer Engineering in 2012, and Doctor of

Philosophy in Electrical and Computer Engineering in 2018 from the University of Florida. He

completed internships with NASA Goddard Space Flight Center in 2012 to 2013, and became a

civil servant with the center in 2014. He was a research group leader for the Advance Space

Systems Group at the National Science Foundation Center for High-Performance Reconfigurable

Computing and the Center for Space, High-Performance, and Resilient Computing from 2012 to

2018 and a visiting scholar at the University of Pittsburgh in 2017 to 2018. He also acted as the

Lab Manager, payload operator of the STP-H5/CSP experiment, and mission manager of STP-

H6/SSIVP experiment at the University of Pittsburgh center from 2017 to 2018.