failure analysis: why mistakes are made and how to avoid

10
4 Electronic Device Failure Analysis EDFAAO (2014) 4:4-12 1537-0755/$19.00 ©ASM International ® Avoiding FA Mistakes Failure Analysis: Why Mistakes Are Made and How to Avoid Making One David Burgess, Accelerated Analysis [email protected] Introduction A good starting point for this discussion is a typical generic definition of the term failure analysis offered by Wikipedia: “Failure analysis is the process of col- lecting and analyzing data to determine the cause of a failure.” Once the cause of failure is known, steps can be made to avoid subsequent failures by eliminat- ing that cause. It follows from the definition that to falsely identify an incorrect cause is the biggest failure analysis mis- take possible. Corrective action focused in a wrong direction is doomed to be ineffective. No progress can be made until the mistake is realized and the correct cause is identified. The net result of a faulty analysis is negative, not a positive. Time is wasted as the problem grows and becomes more complicated. Failure to identify any cause at all is not catastroph- ic, but without an accurate cause, precise corrective action is not likely. The best options available are loosely targeted and speculative. The last few decades have brought massive changes in semiconductor failure analysis. Most changes are in the area of collecting data from the sample. All of the new diagnostic tools are required for an in- creasingly complex technology. Improved imaging, deprocessing techniques, and material analysis tools are essential. In contrast to impressive progress made in physical failure analysis, little attention has been given to the basic failure analysis procedure. In fact, concentration on physical failure analysis may have diverted atten- tion from the basics. Mistakes still occur all too often. Why Mistakes Happen There may be many reasons why an incorrect cause or no cause at all is identified by failure analysis. Poor understanding of basic failure analysis terms and concepts are part of the problem. The examples that follow show how easily thinking can be misdirected. Another factor is that routine failure analysis pro- cedures often neglect to establish the circumstances of failure and its related history. For cases where that missing information is key, physical analysis alone cannot lead to the correct cause. The problem remains unresolved until extraordinary steps are taken to col- lect the information that defines the problem. Basic FA Concepts Failure analysts must think clearly and precisely. The language in our heads is important. Differing meanings of critical terms can cloud our thinking. Important examples include the terms failure, cause, root cause, failure mode, and failure mechanism. Failure. In the definition above, failure refers to the failure event. (The product stopped working.) For convenience, the word failure is sometimes loosely used to refer to the device assumed to have failed. Cause. In the definition above, cause refers to the root cause. The root cause is the key to the failure analysis problem. We also use the word cause to refer to smaller events. For example, “Contaminated water caused exposed metal to corrode.” However, con- taminated water (possibly) is always present. What caused the metal to be exposed? Failure Mode. Failure mode is the characteristic that resulted in the sample being labeled as a failure. High leakage is one example of a failure mode. For failure analysis, we often make many more measure- ments and often choose a failing parameter different from the actual failure mode, but easier to track. This is normally a good assumption. Failure Mechanism. The cause-and-effect progres- sion of a failure from beginning to end is the failure (continued on page 6)

Upload: others

Post on 06-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Failure Analysis: Why Mistakes Are Made and How to Avoid

4 Electronic Device Failure Analysis

EDFAAO (2014) 4:4-12 1537-0755/$19.00 ©ASM International®

Avoiding FA Mistakes

Failure Analysis: Why Mistakes Are Made and How to Avoid Making OneDavid Burgess, Accelerated Analysis [email protected]

IntroductionA good starting point for this discussion is a typical

generic definition of the term failure analysis offered by Wikipedia: “Failure analysis is the process of col-lecting and analyzing data to determine the cause of a failure.” Once the cause of failure is known, steps can be made to avoid subsequent failures by eliminat-ing that cause.

It follows from the definition that to falsely identify an incorrect cause is the biggest failure analysis mis-take possible. Corrective action focused in a wrong direction is doomed to be ineffective. No progress can be made until the mistake is realized and the correct cause is identified. The net result of a faulty analysis is negative, not a positive. Time is wasted as the problem grows and becomes more complicated.

Failure to identify any cause at all is not catastroph-ic, but without an accurate cause, precise corrective action is not likely. The best options available are loosely targeted and speculative.

The last few decades have brought massive changes in semiconductor failure analysis. Most changes are in the area of collecting data from the sample. All of the new diagnostic tools are required for an in-creasingly complex technology. Improved imaging, deprocessing techniques, and material analysis tools are essential.

In contrast to impressive progress made in physical failure analysis, little attention has been given to the basic failure analysis procedure. In fact, concentration on physical failure analysis may have diverted atten-tion from the basics. Mistakes still occur all too often.

Why Mistakes HappenThere may be many reasons why an incorrect cause

or no cause at all is identified by failure analysis. Poor understanding of basic failure analysis terms and

concepts are part of the problem. The examples that follow show how easily thinking can be misdirected.

Another factor is that routine failure analysis pro-cedures often neglect to establish the circumstances of failure and its related history. For cases where that missing information is key, physical analysis alone cannot lead to the correct cause. The problem remains unresolved until extraordinary steps are taken to col-lect the information that defines the problem.

Basic FA ConceptsFailure analysts must think clearly and precisely.

The language in our heads is important. Differing meanings of critical terms can cloud our thinking. Important examples include the terms failure, cause, root cause, failure mode, and failure mechanism.

Failure. In the definition above, failure refers to the failure event. (The product stopped working.) For convenience, the word failure is sometimes loosely used to refer to the device assumed to have failed.

Cause. In the definition above, cause refers to the root cause. The root cause is the key to the failure analysis problem. We also use the word cause to refer to smaller events. For example, “Contaminated water caused exposed metal to corrode.” However, con-taminated water (possibly) is always present. What caused the metal to be exposed?

Failure Mode. Failure mode is the characteristic that resulted in the sample being labeled as a failure. High leakage is one example of a failure mode. For failure analysis, we often make many more measure-ments and often choose a failing parameter different from the actual failure mode, but easier to track. This is normally a good assumption.

Failure Mechanism. The cause-and-effect progres-sion of a failure from beginning to end is the failure

(continued on page 6)

Page 2: Failure Analysis: Why Mistakes Are Made and How to Avoid

Volume 16 No. 4 5

Page 3: Failure Analysis: Why Mistakes Are Made and How to Avoid

6 Electronic Device Failure Analysis

mechanism. Second breakdown is a good example of a failure mechanism. Electric current flows in a semiconductor junction biased near or in avalanche breakdown. Unequal current flow causes nonuniform temperature in the junction. Resistivity decreases with temperature, so the highest temperature loca-tion draws increasingly more current. Rapidly, cur-rent becomes concentrated in one filament. At high temperature, dopants diffuse and silicon may melt.

Electrical Overstress. If there is one term associ-ated with mistakes in failure analysis, it is electrical overstress. When a device failure is said to fail by electrical overstress, it is understood to mean electrical overstress is the cause of failure. That is, application of excessive bias, voltage, current, and/or duration caused the failure. There would be no failure without the damaging bias.

Electrical overstress would correctly apply to sec-ond breakdown caused by application of an excessive voltage.

On the other hand, the same second-breakdown damage can be caused by something other than elec-trical overstress. Suppose an IC transistor consists of two transistors in parallel. A defect such as an open via or metal line could disconnect half the device. The active half-transistor may then fail under normal operating conditions. This simple example is given only to make one point clear: By itself, the observa-tion of damage is not sufficient to conclude electrical overstress is the cause. Electrical ruggedness of ICs depends on many, sometimes tenuous, paths to the ground plane. The IC’s capability to handle electrical stress can be degraded by variations in design and processing. Failure analysis has not done its job if the

task is dismissed upon first observation of melted metal or a voltage arc.

Failure analysis is a special form of problem solving. Most, if not all, failure analysts know the common platitudes about problem solving: “Do not assume,” “Consider all the data,” “Don’t ignore pertinent infor-mation,” and “Don’t jump to conclusions.” Analysts do not argue with these things; they are rather obvi-ous. Nevertheless, mistakes often involve ignoring one or more of these warnings.

Examples from ISTFAJust a few of many examples from past ISTFA pro-

ceedings are highlighted here. A more in-depth dis-cussion of mechanisms causing electrical-overstress-like damage is given by Peter Jacobs in a 2012 paper.[1]

First ISTFA Reference Three of four impressive examples given by Mark

Gores in a 2008 ISTFA paper are described here.[2]

Example 1: Power FETs. Each of the case histories involves a different power transistor. The transistors have multiple source wires and a single gate wire. In two cases, the central area of the chips had been obvi-ously and severely damaged. Metal and silicon ap-peared to have melted and alloyed into a black mass. These transistors were shorted drain-to-source. It is my understanding that a prior analysis had concluded the cause of failure to be electrical overstress (EOS). That conclusion was premature and completely incor-rect. How could such a mistake be made? As Gores suggests, quick conclusions are encouraged by time pressure. Figure 1 shows undeniable and obvious evidence of excessive current and power. On the other

Failure Analysis: Why Mistakes Are Made(continued from page 4)

Fig. 1 A power FET used in a bridge circuit failed short source-to-drain. This photo shows massive damage merging the source metal and substrate drain. The small wire at left is a gate. Courtesy of M. Gores, Hi-Rel Laboratories

Fig. 2 Cross section of a gate bond to frame. The aluminum wire is fractured away from the copper frame. Courtesy of M. Gores, Hi-Rel Laboratories

Page 4: Failure Analysis: Why Mistakes Are Made and How to Avoid

Volume 16 No. 4 7

hand, in Gores’ case 1, a conclusion of EOS would have to dismiss a demonstrated successful history of the device in the application. (Use all the data. Do not ignore data that don’t fit.) Acknowledging that EOS does not explain the known history led to further analysis focused on the gate bonds. Figure 2 shows an open bond at the frame due to mechanical damage. In the application, only one FET was to be “on” (low resistance) while a second FET was “off” (blocking). The open gate bond resulted in two tran-sistors being “on” at the same time. Corrective action was to modify the lead bending operation. No change in bias was required.

Example 2. In Gores’ case 3, a single photo (Fig. 3) shows spectacular electrical damage as well as a clearly lifted bond. Unless it is assumed that EOS caused the bond to lift, EOS must be rejected as a possible cause. The lifted bond must be explained and cannot be ignored. The unexplained symptom should be recognized as a big red flag. Further analy-sis showed that each case failed due to intermittently open gate contacts. In the application, only one FET was to be “on” (low resistance) while a second FET was “off” (blocking). The open gate bond resulted in two transistors being “on” at the same time. With the root cause of the problem identified as a bonding issue, steps were taken to improve bonding, which eliminated further failures.

Example 3. The last FET example failed electrically open. Decapping revealed that all seven source bonds were melted open, as shown in Fig. 4. The FET chip itself was not damaged. Obviously, the bond wires had seen much more current than they could handle.

(continued on page 10)

Electrical overstress is the obvious first thought, but that conclusion does not fit.

As stated earlier, EOS is not a failure mechanism; EOS is a category of causes. Mechanisms associated with EOS include, but are not limited to, dielectric breakdown, I2R heating, second breakdown, and latchup.

In this case, if the cause were EOS, what is the failure mechanism? Would application of excessive current to the device produce the observed damage? (“Don’t jump to conclusions” does not mean don’t consider the obvious. It simply means, don’t accept a conclusion until it is checked out.) The photographic evidence at hand is not consistent with the application of excess current across all source wires. In fact, the details are not even close.

A bonding wire subjected to excess current will overheat near the center. The ends of the wire will be cooler because the silicon or the package frame acts as a heat sink. The wire will melt in the span at a point dependent on the relative efficiency of the heat sinks.[3]

In this example, the source wires show differ-ent degrees of heating. One wire (Fig. 5) shows no signs of heating in the span, but the wire bond at the frame appears to have partially lifted and the remaining metal melted and solidified into shiny spheres. Other wires showed a significant length of clearly overheated wire near the frame bonds. These observations suggest one partially lifted bond with signs of arcing was the first to open. The remaining bonds, now carrying excessive current, overheated.

Fig. 3 This damaged FET shows severe damage source-to-drain on the right side. Just to the left of the damage is the impression of a lifted gate bond. Courtesy of M. Gores, Hi-Rel Laborato-ries

Fig. 4 This FET transistor failed open. Decapping revealed that all source wire bonds at the top of the photo were melted open. Courtesy of M. Gores, Hi-Rel Laboratories

Page 5: Failure Analysis: Why Mistakes Are Made and How to Avoid

8 Electronic Device Failure Analysis

Page 6: Failure Analysis: Why Mistakes Are Made and How to Avoid

Volume 16 No. 4 9

Page 7: Failure Analysis: Why Mistakes Are Made and How to Avoid

10 Electronic Device Failure Analysis

The high-resistance bond overheated, melted, and arced, becoming open. The remaining three wires overheated under their increased load. The last wire fused open violently as it carried all the current. Everything fits. The location of overheating close to frame bonds suggests that all bonds may have poor heat conduction. Tests were introduced to measure the resistance of individual bonds and screen out devices with even one high-resistance bond. The cause of the failure was not EOS, and the simple corrective action could not happen until it was clear that poor bonding, not EOS, was the issue.

Second ISTFA Reference: PCB Edge Connectors

There are several well-known mechanisms for intermittent printed circuit board (PCB) contacts. Debris can become wedged between an edge con-tact and the mating connector contact. Gold can be scratched, exposing underlying nickel, which then oxidizes. Despite the fact that mechanisms have been well known for decades, failures still occur. Faced with significant PCB intermittent contact failures, Munukutla et al.[4] describe a radical deviation from standard FA protocol implemented to understand the mechanism of failure and allow fast and sure corrective action. Their paper, “Damage-Induced Field Failures of Electrical Contacts,” was presented at ISTFA 2009.

The product was a large mother board containing 16 tightly packed memory modules. Each module was essentially a PCB. Because of packing density, the modules were inserted at the factory. The survival of

(continued from page 7)

this product was seriously threatened by intermittent PCB contacts.

Intermittent PCB contact failures were observed upon introduction and early testing of the product (Fig. 6). Failures were fragile. Reseating modules corrected the faults.

Units returned for analysis often worked normally with no problem. Without verifying the specific fail-ure modes, physical analysis of the boards seemed to confirm all the suspected causes. While the results were correct, they did not provide information to help improve the situation.

The breakthrough came when several systems shipped to a nearby customer experienced failures. Analysts moved out of the lab and traveled to the failure site, where tests could be made in situ. Importantly, failing modules could be fixed in place with epoxy. This step of fixing the position of a mod-ule with reference to a connector allowed a failing module and socket to be cross sectioned simultane-ously. A cross section through an actively failing contact provided solid evidence that the failure was due to nickel oxide. Further, knowledge of specific failure sites led to identification of PCB design factors leading to deep scratches in the gold. See the paper for details, but for this discussion, note that the key to identifying the cause of failure was the adoption of extraordinary steps to capture information from the original failure. Without that information, it is doubt-ful PCB design faults would be linked to the failure problem. Knowledge of generic failure mechanisms would be correct but not helpful.

These examples illustrate the danger of underes-timating initial data and the value of defining the problem correctly.

Failure Analysis SequenceIt is difficult to implement the recommended fail-

Fig. 5 One of the open source wires showed no heating evidence in the span, but part of the bond is lifted. Very localized melt-ing suggests arcing. This high-resistance bond opened first, increasing current in the remaining wires. Courtesy of M. Gores, Hi-Rel Laboratories

Fig. 6 Turn-on failures were caused by intermittent contacts in the 16 memory modules mounted on the lower right. Courtesy of A. Munukutla, Intel Corp.

Failure Analysis: Why Mistakes Are Made

Page 8: Failure Analysis: Why Mistakes Are Made and How to Avoid

Volume 16 No. 4 11

ure analysis sequence given below. Hopefully, the examples given here illustrate the importance of steps 1 and 2. Steps 1 and 2, if they are done at all, are not done as part of the failure analysis. Clearly, omission of these defining steps can be critical, or, said another way, failure analysis beginning at step 3 can be futile.

1. Confirm (verify) that a failure has occurred. A sur-prisingly large number of reported failures are in fact good. The reasons for misidentification are numerous, ranging from faulty test procedures to misdiagnosis of troubleshooting at the PCB level. Identification of good devices wrongly labeled as failures is necessary to focus efforts on the real problem. Failure verifica-tion serves that purpose.

In practice, this step is often limited to trouble-shooting at the PCB level. The device is labled as the “failure” and passed on for subsequent analysis. Most relevant facts about the circumstance of failure are irretreiveably lost.

2. Review the device history. The value of this step dif-fers for individual cases, but it can be critical, as both examples show. In practice, devices are submitted for failure anaysis without any history and without refer-

ence to application. Subsequent electrical and physi-cal analysis is still valuable, but the best evidence of cause is often lost in the missing initial data.

3. Characterize the failure mode. Great progress has been made in tools to electrically characterize and isolate faults.

4. Isolate the physical fault. Great progress has been made in tools to identify and isolate physical faults.

5. Identify the failure mechanism. This is a tough step that, understandably, is often omitted. Precise iden-tification of the electrical fault and location of the physical damage is the final result.

6. Recommend a corrective action. Unless the root cause has been identified, the recommendation can only address possible actions to benefit subsequent analysis.

7. Document the failure analysis. Analysis reports and database records of the result are valuable com-pany assets and important resources for subsequent analyses. The file of failure analysis results becomes valuable history for recurring failures.

(continued on page 12)

Page 9: Failure Analysis: Why Mistakes Are Made and How to Avoid

12 Electronic Device Failure Analysis

About the AuthorDavid Burgess is a failure analyst and reliability engineer. He developed techniques and

taught in those areas at Fairchild Semiconductor and Hewlett-Packard. He is the founder of Accelerated Analysis, a manufacturer and distributor of specialty failure analysis tools. David is the co-author of Wafer Failure Analysis for Yield Enhancement. A graduate of Rensselaer Polytechnic Institute and San Jose State University, he is a member of EDFAS and has served on various ISTFA committees. David is a Senior Life Member of IEEE and was General Chairman of the 1983 International Reliability Physics Symposium (IRPS).

Noteworthy ItemIEDM Conference

The 2014 IEEE International Electron Devices Meeting (IEDM) will be held December 15 to 17, 2014, at the Hilton San Francisco Union Square in San Francisco, Calif. The conference features papers on CMOS devices and technology; characterization, reli-ability, and yield; displays, sensors, and MEMS; memory technology; modeling and simulation; process technology; quantum, power, and compound semiconductor devices; and solid-state and nanoelectronic devices.

The IEDM is sponsored by the IEEE Electron Devices Society. For more information, visit the IEDM website at www.ieee-iedm.org.

Root-Cause AnalysisFailure analysis, as defined previously, includes

root-cause analysis. Nevertheless, perhaps because typical failure analysis does not lead to cause, it is suggested that a root-cause analysis (RCA) team be formed to focus on critical failures. The RCA team collectively performs the failure analysis process. In fact, it may be a stretch to expect a lone analyst to take on the roles of all RCA team members. An RCA team may include:

•Oneleadanalystresponsibleforitsresults

•Anassociateanalysttohandleadministrativedutiesand keep records

•One or more experts to help interpret data andgenerate theories consistent with the data

•Acriticwhowill lookforandpointoutholes inthe logic supporting any proposed conclusion. The critic will also force investigation of alternative possibilities.

Yield AnalysisIt is not off point to mention that many of the best

ISTFA papers concern wafer yield analysis. The same failure analysis tools are used. Often an RCA-like

team analyzes wafer yield loss. Identifying cause is the only goal. Unlike FA, device history and as-sociated process data are always available and fully utilized.

References1. P. Jacobs: “EOS (Electrical Overstress)—The Old, Unknown

Phenomena?” Int. Symp. Test. Fail. Anal. (ISTFA), 2012, pp. 156-63.

2. M. Gores: “Mis-Identified Failures in FETs,” Int. Symp. Test. Fail. Anal. (ISTFA), 2008, pp. 481-84.

3. R. King, C. Van Schaick, and J. Lusk: “Electrical Overstress of Nonencapsulated Aluminum Bond Wires,” Int. Reliab. Phys. Symp. (IRPS), 1989, pp. 141-51.

4. A. Munukutla, R. Rahn, and J. Lewis: “Damage-Induced Field Failures of Electrical Contacts,” Int. Symp. Test. Fail. Anal. (ISTFA), 2009, pp. 347-51.

Selected References• M. Horev: Root Cause Analysis in Process-Based Industries,

Trafford Publishing, 2008.

• C.KepnerandB.Tregoe:The Rational Manager: A Systematic Approach to Problem Solving and Decision-Making, McGraw-Hill Book Company, New York, 1965.

• R. Latino and K. Latino: Root Cause Analysis: Improving Performance for Bottom Line Results, CRC Press, 1999.

(continued from page 11)Failure Analysis: Why Mistakes Are Made

Page 10: Failure Analysis: Why Mistakes Are Made and How to Avoid

Volume 16 No. 4 13