risk assessment practices at nasa: studies of design and...

1

Risk Assessment Practices at NASA: Studies of Design and Review Methods

Lawrence P. Chao*, Irem Tumer†, Francesca Barrientos†, and Kosuke Ishii*

*Department of Mechanical Engineering, Design Division, Stanford University, Stanford, CA 94305-4022 †NASA Ames Research Center, Moffett Field, CA, 94035-1000

Abstract—This report describes a number of design and review activities observed at NASA for risk identification, assessment, and management. The NASA life-cycle is centered around the experience of its scientists, engineers, and managers, and the checks-and-balances are primarily instituted through a number of programmatic and technical reviews but also workshops and design tools. This paper explores the project development on both a full mission level as well as an accelerated design project to better demonstrate the range of methods used to identify and manage risk in NASA missions. With better understanding of not only the execution but motivation for the current development life-cycle practices, organizations, including but not limited to NASA, can better improve and error-proof their design process through risk management.

KEYWORDS: design, review, risk, management, assess, NASA

TABLE OF CONTENTS

1. BACKGROUND........................................................1 2. STUDIES OF DESIGN AND REVIEW AT NASA........3 3. CONCLUSIONS .....................................................10 ACKNOWLEDGEMENTS ...........................................11 REFERENCES ...........................................................11 BIOGRAPHY .............................................................12

1. BACKGROUND

Motivation

The National Aeronautics and Space Administration (NASA) has applied design principles with peer reviews and periodic systems design reviews for high reliability aerospace design. The successes and failures of NASA missions have provided lessons learned for the organization’s design review practices. Events of recent years have shown that the design process at NASA, as with other organizations, needs to be continually improved. Missions like the Mars Climate Orbiter showed that reviews could still let simple design errors such as unit mismatches slip through and cause navigation failures. The goal of this paper is to support understanding of design process error-proofing research through discussion of current design and review practices.

Related Research

A broad definition of risks is that they are issues which can impact a person’s or organization’s ability to meet their objectives. From a decision analytic viewpoint, risk is the product of the probability and cost of injury, damage, or loss. Risk analysis and assessment is commonly accepted as common practice particularly in safety-critical industries like aerospace and health care but also large-scale operations and operational decision making, and is expanding to other branches including software, environmental hazards from natural disasters, and homeland security (Beroggi and Wallace 1994). In the analysis, the key questions involve the identification of the location and assessment of the significance of the risk. Assessments must typically include technical (feature), schedule (time), and financial (cost) considerations (Sage 1995). Typically, these are done through point estimates, range estimates, and what-if scenarios [1].

In the context of design, Sarbacker (1998) identified three axes in the framework of risk:

1. Envisioning risk: Will a targeted product with the targeted product attributes create value for the customer and company?

2. Design risk: Does the product as designed, embody the targeted product attributes create value?

3. Execution risk: Can the development team deliver the product as-designed?

Typically, risk management on these three axes is a continuous process, illustrated in Figure 1. For example, the SEI risk management paradigm is an elaboration of the classic “plan-do-check-act” cycle of project management which includes the identification of risks and the communication of these issue and concerns (Williams et al. 1997).

2

Fig. 1. CONTINUOUS RISK MANAGEMENT

Historically, formal and systematic risk management has not always been implemented for all NASA projects, though it has been implemented in various degrees (Fisher et al. 2002). Figure 2 shows the typical risk assessment approach taken at NASA as reported in the requirements document NPR 7120.5B.

Fig. 2. NASA NPR7120.5B RISK APPROACH

Previous work at NASA includes Rose’s (2002) assessment of risk management practices at JPL, where he found much progress since 1995 and the use of qualitative criteria to assess risk but still a lack of a risk management culture. The cases in this paper explore different methods and tools currently used within NASA in the assessment of risk on different levels and scope from organizational to subsystem. The goal of this research is to identify actions which can solve these risks at a higher level in a more robust manner, as illustrated in Figure 3 (Chao and Ishii 2003).

Level 3: Fix the systemLevel 2: Fix the processLevel 1: Fix the specific

problem, very reactiveLevel 0: Denial phase,

rationalize without fixing

V. Prevention: Eliminate the possibility of performing an erring action

IV. Detection: Detect the error immediately after being made

III. Inspection: Design review or inspection of the system

II. Improvement: Improvement to simplify or guide the process

I. Tool: General aid to analysis and design

Fig. 3. SOLUTION AND ROBUSTNESS LEVELS

NASA Review Practices

NASA has a well-established, organization-wide life-cycle. Like many organizations, NASA uses phases as a means to organize decision points. Requirements definition begins in phase A, with refinements and baselining occurring in phase B. Lower level requirements are derived between phases B

and C, and major requirement definition is completed for all levels by phase C. Design reviews are at key transition points along this life-cycle. All NASA missions and spacecraft are subject to a technical design review process. The Technical Design Review Program consists of a subset of such system reviews, depending on if it is a spacecraft or instrument, or new or follow-up. There are a number of system reviews which are performed throughout the lifecycle. For formal reviews, there are a number of guides and documents to help projects through the review process. A standard review checklist is used as an aid to review planning. In addition to these programmatic, system-level reviews, at NASA, a number of technical engineering peer reviews are key. These reviews serve as pre-reviews to ensure success in the formal review (Chao et al., 2004). Jet Propulsion Laboratory’s Project Review document (D-10401) defines a peer review as a working-level, in-depth review convened as needed to evaluate an analysis, concept, design, product, or process thoroughly.

In the NASA life-cycle, the two key reviews are the PDR and CDR. The Preliminary Design Review (PDR) is the first major review of the detailed design and is normally held prior to the preparation of formal design drawings. PDR’s are conducted to confirm that the approach for the system's design is ready to proceed into the detailed design phase. A PDR is held when the design is advanced sufficiently to begin some testing and fabrication of design models. Detail designs are not expected at this time, but system engineering, resource allocations and design analyses are required to demonstrate compliance with requirements. The Critical Design Review (CDR) is held near the completion of an engineering model, if applicable, or the end of the breadboard development stage. This should be prior to any design freeze and before any significant fabrication activity begins. The CDR should represent a complete and comprehensive presentation of the entire design. CDR’s are conducted to demonstrate that the detailed design is complete and ready to proceed with coding, fabrication, assembly and integration efforts.

Risk Tools

In addition to reviews, there are a number of risk management tools used across NASA. In a task for the Integrated Design Capability (IDC) at NASA Goddard, Reuss (2003) performed a survey on risk management-related tools for application in collaborative, rapid design environment and pre-formulation/formulation phase projects, particularly in the identification, classification, and sub-system/system-level reporting versus management. The goal was to do many studies with large variation in the type and mission architectures. The tools reviewed were:

Active Risk Manager (ARM): commercially available software which performed complete life-cycle risk management

Defect Detection & Prevention (DDP): life-cycle risk balancing via what-if mitigation scenarios

3

Project Risk Information Management Exchange (PRIMX): basic life-cycle risk management through identification, prioritization, summarization, and tracking

Risk Assessment Project (RAP): basic life-cycle risk management through prioritization on a color-coded chart

Riskometer: questionnaire, individual-based risk rating System Risk Management Database (SRMD): risk and

related information storage, including lists, mitigation assignments, and information sheets

Worry Generator: question-based risk listings to help identify risks

Though there were many risk-related tools, Reuss only considered risk management tools used for decision analysis. The RAP and Worry Generator methods in particular will be described in these studies here.

2. STUDIES OF DESIGN AND REVIEW AT NASA

Programmatic Review

Kepler Mission — The Kepler Mission is a special purpose space mission in the NASA Headquarters Discovery Program. The goal is to search for terrestrial planets and identify the chances that there are others in the universe that support life. From Earth, more than 100 planets have been found. Kepler will help understand how planetary systems form and the variety of planetary systems in our galaxy. The project started in January of 2001 and is scheduled to launch in October of 2007. Over 50% of the life-cycle is to be spent in the definition phase (A-B), just under 20% in design (C), and the final 30% in the development (D) phase. The Kepler team involves many people including NASA Ames Research Center (science and mission lead), Smithsonian Astrophystical Observatory, UT Austin, UC Berkeley, the SETI Institute (ground-based observations), Space Telescope Science Institute (data management), Deep Space Mission System (data capture), Ball Aerospace (engineering & mission operations), and JPL (program management).

Fig. 4. KEPLER SPACECRAFT

Ground Segment Peer Review — The Kepler Mission Ground Segment Peer Review was held on June 26, 2003 in the Space Projects Facility at Ames. Held in a large conference room, about 35 people were present with the five members of the review board and key members of the team sitting in the center conference room and rows of chairs on the side. The review included experts in operations and data systems from across the country, including members from Caltech, different NASA sites like Ames and JPL, as well as partners and contractors like Honeywell and Ball.

Starting at 8AM, the peer review centered around the Ground Segment of the Kepler Mission. Essentially, this is everything that is not in the Flight Segment (what is in orbit) or the Launch Segment (what puts things into orbit). It includes the scientific activities, scientific data processing, and the science team. This peer review was a pre-review for the SRR (System Requirements Review) which would be held in a few months. This was meant to be a “shirtsleeve” review where people could sit down and “roll up their sleeves” to discuss the topics in a fairly informal manner.

The presentation slides were completely printed and available to the attendees. Requests for Action (RFA) forms were also made available for the review team to make specific recommendations. The session began with introductions of the format, agenda, and of the review board and several key personnel present. The goal was to spend the first few hours overviewing the project, with the review board directing the speed and detail of the presentation. In the morning, the organization of the teams and roles and responsibilities was first discussed. Next, there were discussions of how data was captured, processed, and archived, as well as other elements of the project like documentation and operations.

The first half of the review was quite interactive, with the review board members and others asking questions often during the presentations. The chairman tried to ensure that the presentation stuck to the schedule, and in fact, the presentations ended slightly early for lunch. There didn’t appear to be any guidelines or checklists used to direct the review. The presenter would answer the questions or direct questions to other members of the Ground Segment team. Most questions seemed to be straightforward and able to be answered. Most of the presentation and questions was not that detailed technically, as many of the things that were presented were still in progress and not final. The atmosphere was always fairly relaxed and light, with people unafraid of making jokes throughout.

The afternoon sessions were organized around two sessions: Mission Operations System (MOS) and Ground Data System (GDS). There was the thought of splitting the team and board into two groups to discuss each session separately

4

in more detail, but the team wished to have all the eyes on both presentations to get more perspectives and viewpoints on both. The afternoon presentations were more technically detailed and presented the architecture and approach of the systems, particularly the flow of data. The topic centered on operations and software modules and controls but not really hardware or component engineering. Cost was not a major topic, though scheduling was. The major question from the board involved the requirements flowdown of the mission. In response to the board’s questions, a final presentation was added and centered on the requirements and the documents that support it.

In the second half, the questions by both the review board and different team members were more specific and direct, as were the criticisms and recommendations. The responses to the questions often developed into longer dialogues allowing both sides to explore the issue deeper. In addition to direct recommendations, the reviewers would also give “war stories” of things that could go wrong and things they should do to prevent them in the different contexts. The conversation was still very interactive, and the atmosphere was still one of mutual learning, rather than adversarial. When material was presented that wasn’t deemed relevant, the review board simply said so.

The board concluded the review by highlighting observations they had made throughout the day and going through the RFA’s. Each reviewer would go through his notes and discuss the several specific recommendations or comments they had, point by point. There wasn’t much direct feedback as to the quality of the presentations or answers during the presentation, certainly no disparaging remarks, but at the end, each praised the efforts of the team highly. The team apparently made a more formal and complete presentation than was expected.

Observations — The entire peer review process went very smoothly. The atmosphere was very professional and an environment of mutual learning. Both sides were prepared and knew what they were talking about. The chairman did not need to step in too much to run the review as it propagated itself quite well. People were all familiar with not only the topic but also spoke a common language, largely familiar with all the terminology, though it probably makes it more difficult for an outsider to follow. The reviewers did seem to be seeing the material for the first time, and the presenters knew their material well and didn’t have issues going over their time allocation. They also knew who to direct questions to or their colleagues knew when to step in and help answer details. The questions seemed very relevant, and the answers were direct and succinct. The discussions that ensued seemed very productive.

Technical Peer Review

Engineering Peer Meetings — In July 2003, interviews with the Project Manager (PM) of the ST-6 project explored

a case study of technical review methods at JPL. At the NASA Jet Propulsion Laboratory, one project manager implemented an “engineering peer meeting” system for the informal reviews. The method places a priority on these reviews and plans them from the beginning, scheduling them for about a week before the review. The reviews consist of about two reviewers and two engineers who work in a roundtable format and usually in the engineer’s office. Each review begins with a brainstorming session and continues with a pre-generated template of questions, called “worry generators.”

The technical peer reviews that support teams’ technical work before programmatic reviews are often informal, ad hoc, and inconsistent across the organization. While the reviews did have the proper elements in terms of agenda and expertise, the PM didn’t like the psychology of the reviews. He felt it was conducted like a trial with opposing prosecution and defense looking out for their interests and not the necessarily working together looking for weaknesses in the design. One side is trying to look smart by defending their work, while the other side is trying to look smart by attacking it.

In NASA, the engineers and managers are asked to have reviews very early and very often. The purpose of these reviews is largely to ensure that the team does not go in the wrong direction. However, these early reviews often occur before the team is really ready to “lock in” on ideas or numbers, but they feel the need to produce viewgraphs with their preliminary thoughts and sell it as a more mature idea. And once something is in writing and on a viewgraph, it gets notoriety. These estimates can be based on completely unrealistic assumptions. Costs are particularly dangerous as it is very hard to increase that amount, even if the initial guess is much too low. Formal reviews can even act as a barricade. The key to doing the reviews early is to keep them informal and use peer reviews rather than the formal, mandated reviews. These “peer meetings” need to be conducted in the right way. Both types of reviews are necessary. It is important though to structure them so they complement each other.

Prioritization — It is important to place a priority on peer reviews and plan for them from the very beginning. Though ideally, peer reviews would be done on every subsystem and component in the greatest of detail with all the foremost experts, that is not always possible. Likely, the team needs to do some sort of review on every subsystem but spend different amounts of time and resources on varying subsystems. For example, some programs may be more hardware or software oriented. It is up to the project manager and the subsystem leader to look at areas where there are problems. Software can often have overruns, and in addition, it is often difficult to do a formal review on it. As such, peer reviews are great for software. Other times when peer reviews are essential are when the team lacks experience with the type of project or technology. Before the peer review begins, it is important

5

to identify what is needed out of the review. Is it a “rubber stamp” process? Or should the nature be just to gather as many smart people together to share general comments? The review can try to understand the process identify things that can go wrong (insight) or just be a check to make sure the process has been done (oversight).

Scheduling — As pre-reviews, a good rule of thumb is to conduct the peer reviews about a week or 10 days before the formal review. This gives enough time to react to suggestions and criticisms. To do a review too early, even as much as a month, likely means the review of something that is not the true status by the time the formal review comes. However, the planning of these peer discussions must start to take place well in advance. It can take up to six months to get a good handle of a medium size mission. It takes a month to figure out good questions to ask, a month to collect questions about similar subsystems in the past, a month to find out who to talk to, two months to conduct the interviews, and another month to put the data into something that’s reasonable and use probabilistic risk tools. Nonetheless, the review process is a continuous process. Each review likely uncovers new problems and continually changes the risk posture of the project and future reviews may be needed.

Personnel — Peer reviews are extremely reliant on the skills of both the presenters and the reviewers. Not only must they be technically sound to understand the issues, but they also need the verbal and communicative skills to explain the background and analysis. It is up to the program manager to decide who to include in reviews. Currently there is no organized system or list of reviewers in place for managers to refer to. The process is quite informal but intuitive in many ways. The main constraint is to choose people who are not working on the same project. For example, if the subsystem is in electronics, reviewers should be people who are in electronics, like former cognizant engineers. When experts are needed from other areas, often the best place to start is a manager of that section or line organization.

There have been differing views on the number of reviewers necessary. One school of thought is that the more reviewers, the better, as there are more “sets of eyes” to spot problems. However, one project manager felt very strongly that smaller sets of reviewers are better as they allow more personal interaction. In addition, a smaller peer review can be held in an engineer’s office where he or she has full access to any materials needed and allows them, for example, to look over a piece of paper and make marks directly on it together. This project manager said in his experience, the best “peer meetings” are with two reviewers to one or two team members. In addition, he has the reviewers talk with him informally afterwards to discuss next steps. If there are three or more reviewers, then not only does it limit the dialogue, but it makes the presentation process more formal. With only a few people, the team can just look over the same sheet of paper and have a more

intimate dialogue.

Peer review discussions are usually driven by the design team presenting areas they are not comfortable or confident with. Whenever possible, it is important for the reviewees to recognize the issues and invite reviewers who are familiar and experienced in this area. In addition, it is better if the project manager is not present. This allows a more relaxed atmosphere. The reviewers can simply chat with the project manager informally to update on the process. The value of peer reviews is not passing or failing them. It is to prepare for the formal review. The PM has said that some of his best formal reviews have come after failing a several peer reviews.

Format — It is important to change the way people come into these peer reviews. The format is important, and should emphasize that these aren’t official, formal peer reviews. It is better to just have them in an engineer’s office. If it’s in a conference room, not only does it make it difficult to find and look up material, but it also affects the psychology and the formality of the exchange.

Even the sitting arrangements have psychological impact in peer reviews. By putting the review in a round table format and not having people stand-up and present the material, it prevents the exchange from being too formal or adversarial with a prosecution-defense mentality. The PM recommends having only three, perhaps four, members at a time in peer meetings. From his experience, the best arrangements include one or two team members with two peer reviewers. If there are 3 or more reviewers, then not only does it limit the dialogue, but it is necessary to make copies of the papers and makes the process more formal. With only a few people, the team can just look over the same sheet of paper and have a more intimate dialogue.

One of the things that makes peer reviews under this PM’s projects unique is the brainstorming session that starts each peer meeting. He aims to make the experience different in many psychological aspects. The peer reviews begin with the participants in the room just chatting about the problem and the issues for a while without writing anything down. This allows the conversation to be more free-flowing and also lets the participants to be more relaxed about talking about issues without worrying about details that they will be held accountable for later. The PM has actually tested this technique on different subsystems and found that teams often stay with numbers even if they are not necessarily accurate. When the reviews required the teams to write down numbers, the group often “forced” the numbers and they evolved only 30%, while with teams that weren’t allowed to write things down, their numbers evolved by as much as 400%. When talking about costs, it is important to ask open instead of leading questions. The initial cost conservations should be one-on-one, and should start with discussions on the maximum cost figure before the nominal or minimum.

Risks and Worry Generators — The key to these peer

6

meetings is the list of “worry generators.” These are templates of types of risk and questions that reviewers should be concerned about, and are generated from historical and lessons learned sites. Most reviewers don’t prepare anything before their reviews. Because of the formality of the mandated activities for reviews in addition to the stigma of failure at any organization, it is often difficult to find documentation which talks about failures. Euphemistic language that doesn’t help discuss how to catch errors is often prevalent in reports.

Worry generators include topics such as costs and non-technical issues which are often not discussed in detail by engineers. This guidance helped formalize a normally weak though important process. There are worry generators for each subsystem, with separate ones for software and hardware. However, before each review, it is up to the project manager or subsection leader to review relevant lessons learned and past failure history documents to find specific relevant areas that should be discussed in the peer reviews. For example, power generator issues are likely common for different missions.

Worry generators also include topics that engineers don’t like to talk about, such as costs and non-technical issues. Engineers often feel above programmatic issues and wouldn’t talk about such topics in their presentation. Because issues aren’t technical doesn’t mean they aren’t relevant. It is necessary to worry about even environmental issues, like the allocations needed in reserve if a contractor on the east coast is snowed out. The worry generators are to be used to carry forward discussion into the group brainstorming. Examples of worry generators include:

What can go wrong? What would be the source(s) of the problem? What are the chances of that happening? How certain am I of that?

Do I really know what stresses this process and how this process will respond over its full range of loading?

What are the triggers that might break the process? What signals that it is broke, when, and to whom?

If I were trying to make this process/system/step NOT work, what things could I do?

What things in the environment outside of my area of focus do I not know about fully and which of them might interact with my area? What would be the results? How would I know?

How ready is the technology to be used? How do I know that? For how long will the technology be usable? Why? What limits

the time? How many times will I have to turn the technology over in a 10-15 year period?

Are there any dicey design issues I can foresee? Any test and integration challenges?

How much is the human part of the process affected? How much training, re-staffing, overcoming of resistance will there be: Why? How certain am I?

Can every thing needed be manufactured, delivered, tested, and made operational in time? What happens if not? How likely is that?

Are there any things that could be unsafe or hazardous? How? Under what conditions?

What could one do about each of these things to: − avoid the risk (e.g., change requirements, other options)? − transfer the risk (sell tasks, insure)? − abate the consequences (reduce the impact; provide back up;

insulate/isolate)? − reduce the likelihood of occurrence? − keep track of the situation, watching for triggers, leading

indicators? Is this big enough to do something about, keep track of and

worry about, or accept and ignore? How could we get more information to narrow the uncertainty

of what we know about these potential risks?

The PM also has a well-defined and established process for dealing with risks. Often engineers find the whole process a waste of time. When it comes to risk management and cost estimation, he hires a person full-time or half-time to perform collect data and enter it into “risks lists” for the whole project. This individual goes from office to office prodding people for data. He or she polls engineers to see what the maximum and nominal costs for each risk are and assigns probabilities. Using historical data and systems like MAIS (Mishaps and Anomaly Information System), the analysis is done not only for the likelihood of the risks occurring but also the money required to mitigate them.

Conclusions — For this $25 million NASA program, there were over 100 reviewers at 14 reviews, costing almost $200K. Over 1000 charts were produced, 300 design principles, and 20 versions of requirements documents. The project is still on-going but so far the peer meeting system has worked well. A project manager has tested the “brainstorming” technique on different subsystems and found that teams often stay with numbers even if they are not necessarily accurate. When the reviews required the teams to write down numbers, the group often “forced” the numbers and they evolved only 30%, while with teams that weren’t allowed to write things down, their numbers evolved by as much as 400%. When talking about costs, it is important to ask open instead of leading questions. The initial cost conservations should be one-on-one, and should start with discussions on the maximum cost figure before the nominal or minimum.

Peer reviews often don’t discuss system interactions as the system review takes place in the formal reviews. Usually these formal reviews have the leads of the peer reviews come and report. The PM tends not to discuss system interface issues in the peer reviews because the problem becomes much too big and costly for a peer review to handle. The topic can dominate the conversation and prevent the group from talking about the subsystem’s issues. The PM does advocate having a separate system peer review where members from the subsystem peer reviews attend. The participants report on the problems of current subsystems and discuss the system implications. There are also interface review documents which can aid this meeting. In addition, system issues are discussed at the weekly meetings with all representatives from the

7

subsystems.

This PM believes the best way to improve the peer review process is to target project managers and create a guide on how to conduct reviews, beginning with a background on the psychology of reviews. That can be followed with a training class, probably no longer than 2 hours. The first half-hour session can be on the current state of design reviews, a second on peer review techniques, and a third would be a retroactive example. General guides like these would likely be accepted at both the upper management and project management levels.

The PM believes design reviews can be dangerous in that they might give participants a false sense of security. This project manager has done much work on the psychology of risk reminds reviewers to make corrections for personalities. If a person is optimistic in nature, his or her cost estimate will be optimistic as well. If the person is late to meetings, his/her schedule estimate will likely be underestimated. If the person has never worked on a badly overrun project, the next one likely will be. For most organizations, design reviews are the only line of defense against errors in the design process. Even if they are applied universally, they are still an imperfect gauge susceptible to human errors and will always miss some problems. Nonetheless, they will always be an essential part of any organization’s efforts to error-proofing the development process. The keys to using design review as a part of the design process error-proofing toolkit are to recognize their inherent weaknesses.

Advanced Projects Design

Project Design Center — The JPL Project Design Center (PDC) is a facility which allows system-level design of instruments, spacecraft, mission operations, systems and missions in a concurrent engineering team environment. The primary benefits include a more thorough investigation of the design trade space in shortened design cycle times. The PDC was established in 1994 to improve the quality and reduce the cost of JPL proposals and advanced studies.

Four permanent design teams use the PDC to develop missions and instruments. Team X is the JPL Advanced Projects Design Team that provides proposal teams and advanced study teams with a full complement of space mission design expertise. Team I is the JPL Optical Instrument Development Team. Team I provides real-time support to optical instrument and studies and provides geometry based mechanical, thermal, and optical analyses for remote sensing. As opposed to Team X, Team I is less “intense” and more model-based. Team In-Situ was formed by the JPL In-Situ Center of Excellence to support detailed conceptual design studies of various science instruments and payloads, including rovers, drills, and life-detection experiments. Team G is the JPL Ground Operations Systems Design Team. Team G develops designs and cost estimates for all ground segment components and activities

and performs system studies for ground, flight-ground, and end-to-end mission systems.

The PDC routinely employs real-time, distributed design sessions through file and screen sharing, video and audio conferencing, and interconnected real-time tools. The PDC has shown a three fold reduction in both time and cost and increased the number of designs per year with improved accuracy and communication. In addition to saving travel time and money, it allows other teams and non-resident experts to participate in the design process. The success of the PDC has led to the establishment of a number of similar facilities in other NASA centers and industry.

Team X — The authors spent a week in July 2004 in Pasadena, California at the NASA Jet Propulsion Laboratory PDC observing Team X on a discovery orbiter which would take place 10 years down the road. The mission would orbit the moon of an outer planet in the solar system. The purpose of this Team X session was to do an internal JPL study, not constitute any kind of commitment from NASA. In particular, this session would be a trade study to see what is feasible and the impact of decisions. The approach is to balance three main considerations: cost, performance, and risk. The level of detail was generally high although details were discussed when such detail was needed to clarify an issue or resolve a conflict. The strategy was to go into details only when necessary.

Team X work generally falls into two categories: examination of a design proposal or generation of a new design concept. Though several concepts may be evaluated, the end product of a session is typically a single design. They support JPL mission proposals and advanced studies and are the standing technical review board for all JPL mission proposals. Team X can quickly assemble a complete development plan, including cost, schedule, technical performance, and risk identification, usually completing about 65-85 mission designs and studies a year with a number of 3 hour sessions. Previous studies where they spent a full day showed it to be ineffective as people burned out.

8

Fig. 5. TEAM X LAYOUT AT THE PDC

The Team X session was held in the Product Design Center's Area A. The room had three large projection screens at the front of the room with a table and controls where the moderator could designate what appears on the screen. There were five rows of about five computers each perpendicular to the screens where one "chair" sat at. Each computer was labeled for the chair's domain, including Cost, Thermal, Power, Structures, Configuration, Programmatics, Software, Systems, Ground Systems(GS), Propulsion, Telecom – hardware, system, Visualization, Mission Design, Attitude Control System (ACS), Command and Data handling (CDH), Instruments, and Science. One system lead moderated the entire session and led the flow of the session.

The working hypothesis is that each Team X chair has a set of parameters that are generally relevant for any given mission. Evidence of this is seen in the “guideline sheets” displayed on the overhead by Systems and by the estimated 200 parameters already identified in the current technology models. These guideline sheets seem to list key parameters and their values or descriptions. The Work Breakdown Structure (WBS) is also home to key terms associated with both subsystems and different phases of the mission. These parameters seem to be organized according to subsystem. Each chair built models based on important parameters. After almost 600 sessions, most groups have some sort of legacy. Most were able to rely on libraries of previous experience to build models and analyses quickly. For example, the configurations graphics chair could build a model of the orbiter quickly by looking through previous libraries and adding and scaling them as necessary. While each team worked, in addition to sending final design parameters to the chair, each captured their work and observations in a Word document. The moderator would point out that the chairs should include any thoughts that might not be captured in the design.

The goal of this session was to do trades based on the

requirements flow down by freezing the performance and seeing the impact of cost decisions. The different chairs were linked electronically together by one Microsoft Excel spreadsheet which was continually shown on one or more of the main screens. Each of the chairs had a sheet which would link into it and the main sheet could request information from the sub-sheets and vice versa. The spreadsheet tracked important mission metrics such as the mass, cost, power, total readiness level, and time last updated of the different subsystems. The goal was to bring together the costs, update them, and review them together, and identify the major issues to maintain margins in not only cost but mass as well.

The session started with most of the chairs being attentive and following the moderator but as the session progressed, each of the sub-systems would have their own conversations and work on their analyses while the moderator talked. A few people would flow in and out of the room. Some times the moderator would ask individual chairs questions on the impact of specific tradeoffs and the chairs would usually respond right away based on experience or gut instinct; similarly, the different chairs would speak up whenever the moderator said something they would question. People were all fairly direct in their points, but the atmosphere would remain friendly as jokes were often made. The different chair would often talk out negotiations on tradeoffs such as in mass and power and performance in public to determine what each subsystem needed.

Risk Assessment — The Risk Assessment Prototype (RAP) is an original, JPL software developed by Steve Cornford. It does basic life-cycle risk management by forcing communication between the risk originator (sub-system) and the rest of the system. It is still in a prototype phase and does not do tracking but the identification, prioritization, summarization of risks and related information.

Fig. 6. RAP SAMPLE INPUT

A formalized risk assessment doesn't really come into play currently in the Team X analysis. However, a risk chair would frequently talk to the different chairs and try to ask

9

them to identify their risk items and include them in a risk spreadsheet. All the risks were totalled and put on a "Fever Chart" - a 5x5 FMECA style two-dimensional table which plotted risk elements of each by likelihood and impact and colored coded by priority green-yellow-red. The current view of this risk tool by other chairs was it is perhaps too early in the design phase for Team X as they are in Phase A of the life-cycle and may be better suited for phase B or C. Risks in the Team X environment are called Risk Elements, a nonhomogeous collection of phrases that include failures, failure scenarios, events, causes, issues and concerns.

Fig. 7. RAP FEVER CHART

Conclusions — After the team put together a final design concept, the moderator had Team X spend the final hour doing a very rough mass and cost rollup of a different idea. It would be a simple blimp-type probe using work on a previous study to see if it is feasible and estimate mass but not at the same level of fidelity of the previous work. It appears that the documentation of risks and the use of RAP occur after a design decision has been made. Risks are taken into account through consensus among team members. The extent of individual risk management is unclear based on our observations. There is some inconsistency in how risks are treated individually. They need to be understood, documented, and correlated.

Lessons Learned Workshop

Office of Exploration Systems — On July 24, 2004, the NASA Office of Exploration Systems (OExS) held a “Lesson Learned Workshop" at Ames Research Center in Moffett Field, CA. The workshop included presentation on lessons learned from some senior project managers and engineers followed by discussion and brainstorming sessions. The purpose of the workshop appeared to be to make recommendations to Admiral Steidle as part of the "President's Vision for Space Exploration." Implicit in this workshop was the discussion of what changes are needed in NASA and even the role of the space agency in the future. The panel presentations were five 10 minute talks by different managers in NASA Ames, including directors,

division chiefs, managers, and engineers from Flight Projects, Projects Management, History Office, Chief Engineer’s Office, and current NASA missions.

Several presentations referred to the change that the administration has undergone over the last 40 years, comparing not only projects and missions but even center directors. Most of the presentations centered around organizational issues. Some discussed project successes and what they attributed them to. Others discussed challenges in working with subcontractors and the balance between insight and oversight, and clarifying the roles of who is the customer versus the provider.

Of particular interest, the presentation by the Chief Engineer looked at technical lessons learned in the context of recent failures and events at NASA (Panontin 2004). Through discussing specific examples and general lessons, she was able to identify precursors and remediations for technical risks and failures. Shown in Table 1, the causes of additional risks and technical failures include complacency, being spoiled by success, ignoring outliers, inadequately accounting for changes, assumptions out of context, system interactions, and how having too many safety systems actually reduces it.

Table 1. LESSONS FROM THE AMES CHIEF ENGINEER

Cause Lesson Process drift (complacency)

Successful, well-established teams tend to drift away from rigorous compliance to processes and procedures

Probability and uncertainty

Misunderstanding probabilities and underestimating uncertainties lead to success-engendered optimism

Anomalies and outliers

Project teams tend to discount realism and relevance of unexpected or unusual system behavior or address systems rather than cause

Change Consequences of changes, whether unintentional or intentional, are overlooked

Assumptions out of context

Rationale behind previous decisions and key differences between current and previous conditions, test and actual environments are not well explored and understood, leading to problems in reuse, proofs by similarity, and verification and qualification tests

Systems of systems

People tend to think in linear, static, single causal chains while many systems behave non-linearly, dynamically, with many interrelated feedback chains

Un-safety Safety systems including redundancy and automation can fail, present erroneous signals and even reduce safety while providing a false sense of security

In the ensuing discussion and brainstorming sessions, the group identified key lessons. First was to establish and clarify program plans. The second recommendation was to empower project managers with the resources, authority, (and accountability) necessary to complete projects, and for

10

NASA to protect them from irrelevant issues. Third, the financial accounting system is causing many problems with projects; in addition, too many projects are being underfunded, rather than choosing the right set. Projects are also becoming increasingly risk averse, and the current risk management system is considered cumbersome. Also of interest were organizational risks and the role of NASA itself and how it may not necessarily be to further space exploration in the future but rather develop new systems engineer and technologists which are currently in scarce supply.

Spiral Development — A related topic discussed at the end of the workshop was the usage of the Spiral Development Model. Spiral development began as a family of software development processes characterized by repeatedly iterating a set of elemental development processes and managing risk so it is actively being reduced. Groups like the Software Engineering Institute of Carnegie Mellon University and the Center for Software Engineering of the University of Southern California have sponsored workshops to understand and educate on this method.

Fig. 8. SPIRAL DEVELOPMENT MODEL

The spiral model can be used for a cost-effective incremental commitment of funds. In addition, it includes related practices as legacy system replacement and integration of commercial-off-the-shelf (COTS) components. An important and relatively recent innovation to the spiral model was the introduction of anchor point milestones. The question for NASA is how to relate pieces of a program like developing a solar array and doing propulsion and relate it to the mission. It is necessary to

further develop procedures and identify the proper go/no-go spots in the life-cycle.

The Spiral Development Model features cyclic concurrent engineering, risk driven determination of process and product, growing a system via risk-driven experimentation and elaboration, and lowering development cost by early elimination of nonviable alternatives and rework avoidance. As a result of planning and risk analysis, different projects may choose different processes. The spiral model is a process model generator in which different risk patterns can lead to choosing incremental, waterfall, evolutionary prototyping, or other ubsets of the process elements.

Though it is still not a well-understood concept across most of NASA, the Spiral Development Model is being implemented in the “Project Constellation” timeline to iteratively evolve the vision of NASA for the nation in both unmanned and manned space vehicles, particularly in Mars missions in the future (2020+). From those at NASA with experience with the model, the consensus was that it worked fairly well, but the challenge is that training takes time.

3. CONCLUSIONS

The programmatic and technical design and review activities in NASA’s life-cycle are used to not only generate concepts but also understand and mitigate technical, schedule, financial, and organizational risks. Risks can result in failure if not properly identified and corrected for. Reviews and risk assessment tools are used across organizations like NASA but can be dependent on those who execute it. Robust solutions require earlier action on the system level and not merely the identification of particular risks. An important part of risk management is to plan risk mitigation and understand that not every risk can be mitigated and every condition has many consequences.

Much future work is required to handle the risks associated with complexities of organizations like NASA and the technologies they deal with. Though NASA had an Office of Reliability and Quality Assurance during the early 1960’s, it disappeared by 1963 and the only type of safety program since was a decentralized “loose federation” of risk assessment oversight run by each program and project office. In the aftermath of the Columbia, the investigation board found that hazard reports and risk analyses are rarely communicated effectively, nor are the many databases used by the engineers and managers capable of translating operational experiences into effective risk management practices.

NASA’s future collaboration with Stanford University will primarily be centered on risk and reliability assessment

11

from a system viewpoint, including hardware and software interfaces and supply chain issues. In addition, NASA Ames Research Center has worked closely with the University of Missouri at Rolla in understanding function mapping with failure data and the development of software tools to implement common design repositories for NASA design teams.

ACKNOWLEDGEMENTS

We sincerely appreciate the assistance of Mike Van Wie, Matt Bohm, and Rob Stone and the cooperation of Chet Borden, Art Chmielewski, Steve Cornford, Leila Meshkat, Tina Panontin, Steve Prusha, Claire Smith, Bob Umberto, and Larry Webster. Special thanks to Engineering for Complex Systems Program at NASA Ames Research Center and Mission Critical Technologies.

REFERENCES

[1] Beroggi, G.E., and Wallace, W.A., 1994, “Operational risk management: a new paradigm for decision making,” IEEE Transactions on Systems, Man and Cybernetics, Vol. 24 , Issue: 10, Oct., 1450 - 1457

[2] Boehm, B., and Hansen, W.J., 2000, “Spiral Development: Experience, Principles, and Refinements,” Spiral Development Workshop, Special Report: CMU/SEI-2000-SR-008, February 9.

[3] Chao, L.P., and Ishii, K. 2004, "Design Process Error-Proofing: Project Quality Function Deployment," Proceedings of the ASME DETC: DFM, Salt Lake City, UT.

[4] Chao, L.P., and Ishii, K., 2003, "Design Process Error-Proofing: Developing Automated Error-Proofing Information Systems," Proceedings of the ASME DETC: DAC, Chicago, IL.

[5] Chao, L.P., Beiter, K., and Ishii, K. 2001. "Design Process Error-Proofing: International Industry Survey and Research Roadmap," Proceedings of the ASME DETC: DFM, Pittsburgh, PA.

[6] Chao, L.P., Tumer, I., and Ishii, K. 2004, "Design Process Error-Proofing: Engineering Peer Review Lessons from NASA," Proceedings of the ASME DETC: DFM, Salt Lake City, UT.

[7] Chmielewski, A.B. “Psychology of Risk Management: The ST6 Experience.” NASA New Millennium Program.

[8] Fisher, K., Greanias, G., Rose, J., and Dumas, R., 2002, “Risk management tools for complex project

organizations,” IEEE Aerospace Conference Proceedings, Volume 2, 9-16 March, 721-727.

[9] Gehman, H.W., et. al., 2003, “Columbia Accident Investigation Board Report,” National Aeronautics and Space Administration, Volume I, August.

[10] Huber, T.E., ed. 1992. “The NASA Mission Design Process.” NASA Engineering Management Council. 22 Dec. 1992.

[11] Meshkat, L., Cornford, S., and Moran, T. 2003. “Risk Based Design Tool for Space Exploration Missions.” Proceedings of the American Institute of Aeronautics and Astronautics (AIAA) Space 2003. Long Beach, CA.

[12] Panontin, T., 2004, “OExS Lessons Learned Workshop,” NASA Exploration Systems Office Lesson Learned Workshop, NASA Ames Research Center, Moffett Field, CA, July 28.

[13] Quinn, J., 1994, “Flight P/FR’s and the Design Review Process.” JPL. D-11381.

[14] Reuss, L.M., 2003, “Risk Management Tool Survey,” SAIC/NASA Goddard report, September 5.

[15] Rose, J.R., 2002, “Risk management at JPL - practices and promises,” IEEE Aerospace Conference Proceedings, Volume 2, March 9-16, 641-649.

[16] Rose, J., 2003, “Project Reviews (D-10401), Rev. B.” JPL Rules! DocID 35163. April 23.

[17] Sage, A.P., 1995, “Risk management systems engineering Systems, Man and Cybernetics,” IEEE International Conference on Intelligent Systems for the 21st Century, Volume 2, 22-25 Oct., 1033 – 1038.

[18] Williams, R.C., Walker, J.A., and Dorofee, A.J., 1997, “Putting risk management into practice,” IEEE Transactions on Software, Volume 14, Issue 3, May-June, 75 – 82.

[A] ASC Engineering Directorate – Integrated Risk management

http://engineering.wpafb.af.mil/risk/risk.asp [B] NPR 7120.5B

http://nodis.gsfc.nasa.gov/displayAll.cfm?Internal_ID=N_PR_7120_005B_&page_name=ALL

[C] Active Risk Manager (ARM) http://www.strategicthought.com/stl/risk/arm.asp?section=risk&pagename=riskarm

[D] Defect Detection & Prevention (DDP) http://ddptool.jpl.nasa.gov

[E] Project Risk Information Management Exchange (PRIMX)

http://www.swales.com [F] Riskometer

http://www.futron.com/riskmanagement/tools/Riskometer%20FREEBIE.xls

[G] System Risk Management Database http://osat-ext.grc.nasa.gov/rmo/riskdb/

12

[H] Worry Generator http://www.futron.com/riskmanagement/tools/Worry%20Generator.doc

BIOGRAPHY

Lawrence Chao earned his BS in Mechanical Engineering at the Massachusetts Institute of Technology and MS in

Mechanical Engineering Design at Stanford University. He has worked at and with organizations including General Electric, ABB, NASA, and General Motors. He is currently a Ph.D. candidate and research assistant with the Manufacturing Modeling Laboratory of

Stanford University's Department of Mechanical Engineering Design Division under Professor Kosuke Ishii. His research focus is on "design process error-proofing" developing tools, methods, and strategies to understand, predict, and prevent product development errors.

Dr. Irem Y. Tumer has been with the Computational Sciences Division at NASA Ames Research Center since

1998. Her research interests focus on formal design methods, risk-based design, risk mitigation, and failure analysis and fault detection systems, with the overall goal to improve the state-of-the-art in designing reliable and robust NASA mission-enabling systems. She is

a Level 3 project manager, managing projects ranging from risk characterization and optimization to model based risk analysis. She received her PhD in Mechanical Systems & Design from The University of Texas at Austin.

Dr. Kosuke Ishii earned his BSME at Sophia University, Tokyo, MSME at Stanford University, and Masters in

Control Engineering at Tokyo Institute of Technology. After serving Toshiba Corporation as a design engineer, he returned to Stanford and completed his PhD in Mechanical Design. He currently holds the rank of full professor at Stanford University, serves as the

director of the Manufacturing Modeling Laboratory, and focuses his research on structured product development methods, commonly known as "Design for X."

risk assessment practices at nasa: studies of design and...

Documents