on the role of statistical significance in exploratory ... · for exploratory data ana[ysis called...

On the role of statistical significance in exploratory data analysis

Inderpal Bha~dari and Shriram BiyaaiIBM T. J. Watson Research Center

PO Box 704Yorktown Heights, NY 10598, USA

e-mail: isbQwatson.ibm.com

AbstractItece~tl~, an approach to k~owldge d~co~er~,

called A~ribute Focusing, ham been used by.soflwa~development teams to discover ~¢h ~owledgecategorical defeef data as allows them to imFeo~e~eir process of software development in real Sime.This feedback is provided by computing the differ-ence o] the obme~ed propo~ion wiShin a selectedea~ego,j ~om an effipected proportion for that cate-gory, and then, by siudlling those di.ffe_rences to iden.Sift,/their causes in She co¢~ezt of She process andproduct. In this paper, we consider the possibilitythat some differencem may #im~ly ha~e oceu~red b~/chance, i.e., as a co~ce of some random effectin the proceu 9races Sing the data. We de~elo? anapproach based on statistioal significance to identifysuch differences. Preliminary, empirical results arep~esented ~hich iedicate that kno~oledge of statisticalsip~ificance should be used carnally ~hen selecti¢$differences to be studied So identify causes. Conven-tional ~dOm would suggest Shat all differences thatlie bsyvnd some small l~el of statistical significancebe eliminated from con~i~eraSion. O~r r~sul~ showthat such elimination is not a 9ood idea. Tl~ abo~how that information o~ statistical significance canbe useful in the process of identifyin9 a cause.

Keywords: data ezploraSioa, k¢owledge discov-er~, statistical zignifica~ce, software engineering.

1 IntroductionKnowledge discovery in databases [1] often in-

volves the identification of statistical relktlonships indata as a first step towards the discovery of knowl-edge. The main problem with such identificationis that in a large data set there usually are verymany statistical relationship~ that exist but whichare quite useless in the context of knowledge discov-e:y. For instance, some of those relationships willbe well known, and hence, not be meaningful in thecontext of knowledge discovery. Those relationshipscan be screened for and removed by incorporatingdomain knowledge in the knowledge discovery sys-tem [2]. Other relationships will reflect purely sts-statical relationships, i.e., relationships that occurby chance because of random effects in the underly-

¯ ing processes used to generate and collect the data.Such chance relationships are undesirable as they

may lead to the discovery of spurious knowledge.Those relationships can be screened for and removedby using methods of statistical validation The iden-tification of relationships that occur by chance hasalways been a concern in statistical analysis, specifi-cally, in the arena of hypothesis testing ([3], Chapter6). The approach used to eliminate such relation-ships is based on determining the extent to which anobserved relationship was produced by chance. Oneuses that extent to determine whether to accept orto reject that relationship. Clearly, a similar strat-

~-~yancau be used in knowledge discovery to removeCe relationships, and, it often is recommendedin the literature [2, 4, 5].

This paper suggests that such a strategy be usedcarefully since the circumstances under which hy-

pothesis testingis done in the context of knowledgeiscovery maybe different from conventional appli-

cations of such testing. Hence, it may not be ad-visable to follow the dictates of conventional wis-dora. In particular, we study the use of statisticalsignificance in the context of the application of anapproach to knowledge discovery called AttributeFocusing to software engineering. By utilizing datafrom a fielded application, it is shown that use ofstatistical significance can eliminate useful relation-ships, i.e., those that are in fact representative ofnew knowledge. Since extensive field work on ap.plying attribute focusing to the software engineer-

~gd~m~oh~is:oh°e~W~v th~tsttbe ~u~bae~s°fmPa~ternsh ad d" ypr o d a" 1-anaverage of two to three patterns - such elimination isnot desirable. Thus, our results show that one can-not simply make use of statistical significance as isconventionally accepted but instead one must under-stand how it should and should not be incorporatedin attribute focusing.

2 Background on software process

improvementWe begin by providing the necessary background

on software engineering. The software productionprocess [6: 7] provides the framework for develop-ment of software products. Deficiencies in the activ-Sties that make up the process lead to poor qualityproducts and large production cycles. Hence, it isimportant to identify problems with the definition

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 61

From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.

or execution of those activities.Software production results in defects. Examples

of defects are program bugs, errors in design docu-ments, etc.. One may view defects as being mani-feat by deficiencies in the process activities, hence,it is useful to leazn from defect data. Such learningis referred to aa defect-based p~’oceu improvement.If it occurs during the course of development, it isreferred to as in-peoceas improvement. Else, if it oc-curs after development is completed, it is referredto as post.p~ocess improvement. In many produc-tion laboratories, software defects are classified bythe project team in-process to produce attribute-valued data based on defects [8, 9, I0.]. Those datacan be used for defect-based process unprovement.

Recently, Bhandari, in [11], introduced a methodfor exploratory data ana[ysis called Attribute Fo-cumins. That method provides a systematic, low-cost way for a person such aa a software developeror tester, who is not an expert at data analysis, toexplore attribute-valued data. A software develop-ment project team can use Attribute Focusing toexplore their classified defect data and identify andcorrect process problems in real time. There havebeen other efforts in the past that focused on dataexploration and leazning from data but they do not

¯ focus on providing real-t/me feedback to a projectteam. We mention them for the sake of complete-ness. Noteworthy examples are the work by Selbyand Porter [12] and the work on the Optimized setReduction Technique by Briand and others [13].

The application of attribute focusing to processimprovement was fielded in mid 1991. Major soft-ware projects within IBM started using the methodin acutal production of software. By year-end 199%there were over 15 projects that were using themetho~l. The extensive field experience has allowedus to document evidence [14, 15, 16, 17] that At-tribute Focusing (AF) has benefitted many differentdevelopment projects within IBM over and beyondother current practices for both in-process and post-process improvement. In [14], Bhandari and Rothreported on the experience of using AF to exploredata collected from a survey based on system testknd early field defects. In [15], Chaar, Halliday,et al, utilized AF to explore defect data baaed onthe four attribut~ in the Orthogonal Defect Clas-si~cation scheme [10] and other commonly used at-tributes to assess the effectiveness of inspection andtesting: In [16], the experience of using AF to ex-plore such data to make in-process improvementsto the process of a single project was described,while in [17], such improvement was shown to haveoccurred for a variety of projects that covered therange of software production from mainframe com-puting to personal computing.

We will make use of the above-mentioned fieldexperience later on. For the moment, we seek ahigh-level understanding of the main topic of thispaper. Attribute focusing makes Use of functions

’to highlight parts of the categorical data. Thosefunctions have the form:

l’(,) = o(,) - E(,) where O(x) represents what was observed in thedata for category z while E(x) represents what wasexpected for that category. Equation 1 computesa difference between two distributions. In the caseof Attribute Focusing, such differences are used tocall the attention of the project team to the rela-tionship O(z). The team determines whether thoserelationships are indicative of hidden process prob-lems, and if so, determines how those problems mustbe resolved.

In this paper, we consider the possibility thatsome differences may simply have occurred bychance, i.e., aa a consequence of some random el-feet in the software development process. There aremany possible Sources in a complicated process suchas software development which could contribute tosuch an~ effect. A simple example is the process ofclassifying data itself. A person could misclassifydefects by simple oversight, which could be modeledmathematically by a random process.

After presenting background information (See-tion 3), we discuss the effect, of such chance-baaeddifferences on Attribute Focusing (Section 4). develop an approach based on statistical significanceto determine the degree to which a difference couldhave occurred by chance (Section 5). Preliminary,empirical results are presented and used to illustratehow that approach should and should not be incor-porated in Attribute Focusing (Section 6). We dis-cuss the general implications of our results, and, inconclusion, summarize the limitations and lessons ofthis paper (Section 7).

3 Attribute FocusingTo make this paper self contained, we present

background information on Attribute Focusing andhow it is used to explore classified defect data. Themethod is used on a table of attribute-valued data.Letus describe how such a table is created fromdefect data.

Recall (Section 2) that defects are classified the project team. The classification of defects re-Sults in a table of attribute-valued data. Let usillustrate what such a. table looks like. We beginby reviewing some attributes which are being usedat IBM to classify defects. Those attributes arethe same as or. modifications of attributes that havebeen reported in the past in the considerable litera.ture which exists on defect classification schemes.

The attributes m~siag/ineo~ec~ and ~Fpe cap-ture information on what had to be done to fix thedefect after it was found. ~sing/inco~,ee~ has twopossible values which may be chosen, namely, miss-ing and ineo~,ect. For instance, a defect is classifiedm~sing if it had to be fixed by adding somethingnew to a design document, and classified ineo~,eet if

¯ it could be fixed by makin~ an in-place correction inthe document. T~pe has eight possible values. Oneof those values is f~nc~ion~ The ~1/pe of a defect ischosen to be )~ne~ion if it had to be fixed by cor-recting major product functionality. The attribute

Page 62 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

~eiggsr captures information about the specific in-spection focus or test strategy which allowed thedefect to surface. One of its possible values is back.ward compatibility~. The ~eigger of a defect found

by thinkin 8about the compatibility of the currentrelease to the previous releases is chosen to be ba¢l¢.~ard ¢ompa~ibili~I/. The attribute componen~ is usedto indicate the software component of the productin which the defect was found. Its set of possiblevalues is simply the set of names of the componentswhich make up the product.

The software development process consists of a¯ sequence of major activities such as design, code,

test, etc.. After every major activity the set of de-recta that are generated as a result of that activityare first, classified, and then, explored by the projectteam using Attribute Focusing. A set of classifieddefects may be represented by an attribute-valuedtable. Every row represents a defect and everycolumn represents an attribute of the classificationscheme. The value at a particular location in thetable is simply the value chosen for the correspond-ing attribute to classify the corresponding defect forthat location. Table I illustrates a single row of sucha table. The names of the attributes are in uppercue and the values of those attributes chosen forthe defect are in lower case. The single row in thetable represents a defect which had to fixed by in-troducing new product functionality, was found ina component called India and detected by consider-in~ the compatibility of the product with previousrel#.A~es.

Once an attribute.valued table has been createdfrom the defect data, a project team can use At-tribute Focusin~ to explore the data to identifyproblems with their production process. That ex-ploration has two steps. The first step utilizes au-tomatip procedures to select patterns in the datawhich are called to the attention of the team. Next,

¯ the patterns are studied by the project team to iden-tify process problems. Let us review those steps inturn.$.1 Computing Differences

Patterns in the data are called to the attentionof the project team based on the differences be-tween observed statistics and theoretical distribu-tions. Those differences are defined by two rune-tions,/i and Is, which are discussed in turn below.

1 (x = = p(x = - i/oho e(x)

where X is an attribute, a is a possible value forthat attribute, p(X - a) is the proportion of rowsin the data for which X = a, and Ohc~ce(X) is thetotal number of possible values that one may choosefor X. Ix is used to produce an ordered list of allattribute.values. Table 2 shows a part of such anordering.

The table shows the observed and expected pro-portions and their difference for two attribute.valuesas computed by Ix. The column A~bu¢e-~alue cot-responds to X = a in Equation 2, while Obse~sd

corresponds to p(X = a), Bzpec¢ed corresponds to1/Ohoice(X = a) and Difference corresponds toIx(X = a). Thus, we see that 27% of the de-fects were classified f~nc~ion and 44% of the defectswere classified missing. The expected proportionsare computed based on what the proportions of theattribute.values would be ff their respective attrib-utes were uniformly distributed. For instance, ~nc-(ion is a value for the atttribute ~/pe, which has eightpossible values. Hence, its expected proportion is

12.5%. Missing is a value of the attribute M~ss-ing/In¢oerec¢ which has two possible values. There-fore, the expected proportion is 50%. The columndifference in the table simply computes the differ-ence between the obseeved and ezpec~ed column forevery row in the table. The table is ordered by theabsolute value of that difference. A similar tableis computed based on Is (Equation $) as discussedbelow.

12(X =a,Y = b) =p(X-a,Y= b)-p(X a) * p(Y = b), where p(X = a) is the proportion defects in the data for which X - a, while p(X a, Y --b) is the proportion of records in the data for

¯ which X = a and Y --- b. Is is used to produce anordered list of all possible pairs of attribute valuesthat can be chosen for a defect. Table 3 illustratesa part of such an ordering.

The columns ~al~el, ~al~e~ denote attribute.values. For brevity, we have not omitted the at-tributes themselves and only specified the values ofinterest. Val~el, ~alue~ together define the cate-gory of defects represented by a row. For instance,

the first row has information about those defects forwhich the attribute ~pe was classified ~unction andthe attribute missing/incow’ec~ was classified miss-ing. The second row has information about thosedefects for which the attribute¯ eomponen~ was clas.sifted India and the attribute ~gger was classifiedbackward compa~ibi[i~I/.

Let us understand the information in the firstrow. obsl specifies that 27% of the defects were clas-sifted ~ne¢ion. obs~ specifies that 44% of the defectswere classified missing, obM~ indicates that 15% ofthe defects were classified both f~ne~ion and miss-ing. Ezpeel~ is simply the product of obsl and obeY,as per the/2 functmn. Thus the expected value iscomputed based on what the proportion of defectsthat were classified using two attribute.values wouldbe if the selection of those attribute-values was sta-tistically independent. Diffis the difference of obsl~and ezpecl~. The table is ordered by the absolutevalue of that difference.

Equations 2 and 3 can both be cast in the formof Equation 1. Note that in both cases the expecteddistributions are based on weU-known theoreticalconcepts. Ex is based on the concept of uniform dis-tribution while/2 is based on statistically indepen-dent distributions. To appreciate the exploratorynature of Ix and Is, note that the choice of those ex-peered distributions is not based on experience withsoftware production, but is based instead on sire-ple information-theoretic arguments. An attributewill convey maximum information if we expect it


MISSING/INCORRECT

,,,ise£ug

TABLE 1

TYPE

function

TRIGGER

back. compatibility

COMPONENT

India

attribute-value

Type= functionMissing/Incorrect= aissin8

TABLE 2

Observed

27Y.44Y.

Expected Difference

12.6~ 14.6~5o~ e~

to have a uniform distribution, while two attributeswill convey maximum information if their expecteddistributions are statistically independent. To geta quick appreciation of that argument, consider theconverse case. If a single value is always chosen foran attribute, we may as well discard that attribute.It is useless from the point of view of presentingnew information. Similarly, if two attributes are

oPferfectly related so that we can predict the valueone knowing the value of the other, we may aswell discard one of those attributes.

Continuing with the description of Attribute Fo-cusins, I$ and/2 produce orderings of attribute-values and pairs bf attribute-values respectively.The top few rows of those orderings are presentedto the project team for interpretation. The selec-tion of those items is based on the amount of timethe project team is willing to spend interpreting thedata. ~That time is usually around 2 hours for aset of data that has been generated. As we shallsee, the team uses a specified model to interpret thedata that is presented to them. Hence, we can cal-ibrate the time taken to interpret items in the dataand determine how many rows should be presentedto the team for each ordering.

For the purpose of this paper, we omit the de-tails of calibration and simplify the details of pre.sentation and interpretation. We will assume thatthere &re two tables presented to the team, one ta-ble based on/i and one table based on la; Bothtables have been appropriately cut-off as explainedabove, and that the team interprets a table, one rowat a time. In actuality, the team interprets a collec-tion of smaller tables which when combined wouldresult in the larger tables based on/I or/2. Theteam will also often consider the relationships be-tween items in different rows that are presented tothem. All those details are adequately covered in[Ii, 14, 16, 17]. For the purpose of this paper itis the total number of items which are interpretedthat is important and not the form those items arepresented in, or the fashion in which they are in-terpreted. Hence, we use the simplified description

above.

3.2 Interpretation of dataThe team interprets the information about the

attribute-values in each table, one row at a time.A row in the tables based on 11 or Ia presents in-formation about the magnitude or association ofattribute-values respectively. We will refer to thatinformation as an i~em. The team relates everyitem to their product and process by determiningthe ca~e and implication of that magnitude or as-sociation, and by co~obora~ing their findings. Theimplication is determined by considering whetherthat magnitude or association signifies an undesir-able trend, what damage such s trend may havedone up to the current time, and what effect it mayhave in the future if it is left uncorrected. The causeis determined by considering what physical events in

the domain could have caused that magnitude or as-sociation to occur. The c~use and implication areaddressed simultaneously, and &re corroborated byfinding information about the project that is notpresent in the classified data but which confirmsthat the cause exists and the implication is accu-rate. Details of the interpretation process used bythe team may be found in [17, 16]. We will notreplicate those details here but will illustrate theprocess of interpretation by using the items in Table3. Those items are taken from a project experiencethat was the subject of [16].

Bz,mple I: The first item in Table 3 shows anassociation which indicates that defects which were

¯ classified ~nc~ion also tended to be classified miss-ing. Recall, that ~’~nc~ion defects have to be fixed bymaking s major change to functionality, and miuin9defects have to be fixed by adding new material asopposed to an in-place correction. The implicationis that the fixes will be complicated, quite possiblyintroducing new defects as new material is added.Hence, this is an undesirable trend which shouldbe corrected. The team also determined that thecause of the trend was that the design of the re-covery function, the ability of the system to recoverfrom an erroneous state, was incomplete. The tex-

Page 64 AAA/-94 Workshop on Knowledge Discovery in Databases KDD-94

TABLE 3

valuel value2 obsl obs2

function missing 27~ 44~~Ludia back. compat. 39~ 10~

tual descriptions of the defects that were classified

fJ~oonCtion and missing were studied in detail. It wasund that all such defects pertained to the recoveryfunction, hence, corroborating the cause.

E=ample ~: The second item in Table 3 showsa disassociation which indicates that defects whichwere classified India did not tend to be classifiedbackward eompafibilitll. The team determined thatthe cause of the trend was that issues pertaining tocompatibility with previous releases had not beenaddressed adequately in the design of componentIndia. In other words, since they had missed out onan important aspect of the design of the component,there were few defects to be found that pertainedto that aspect The implication was that, if the de-sign was not corrected, existing customer applica-tions would fall after the product was shipped. Theexistence of the problem was corroborated by not-ing the lack of compatibility features in the designof component India, the existence of the compatibil-ity requirement for the product, and the consensusamongst the designers in the team that componentIndia should address that requirement.

The project team goes through the tables itemby item in the above fashion. Items for which theimplication is undesirable and the cause is deter-mined lead to corrective actions, which are usuallyimplerrfented before proceeding to the next phase ofdevelopment. Items for which the implication is un-desirable but the cause cannot be determined in theinterpretation session are investigated further at alater time by examining the detailed descriptions ofthe relevant defects (that were classified using theattrlbute-values in the item under investigation). Ifthe cause is still not found after such examination,the items are not considered any further. We wills~y that such items have been dism~se£

4 The effects of chanceLet us consider what would happen if an item in

the tables presented to the team had occurred purelyby chance. It is certainly possible that item mayhave an undesirable implication, since the implica-tion is often determined by considering the mean-ings of the attribute values in the item. In Example1 (Section 3.2), the meanings of the words functionand missing were used to conclude that the trendwas undesirable. Let us examine what would hap-

~en if the relevant association for that example, therat item in Table 3, was the result of a chance effect

and did not have a physical cause ?Since the implication was undesirable, the team

would investigate that trend. Therefore, an obvious

adverse effect of the chance occurrence is that itwould waste the time of the team.

What would be the result of that investigation ?There are two possibilities, a cause would be foundor a cause would not be found. We discuss eachpossibility in turn to determine how the process ofinterpretation could be improved to address chanceeffects.

Let us examine how a cause could be found forsuch an item. If a cause is identified, it implies thata mistake was made by the team since we have as-sumed that the item occurred by chance and had nophysical cause. Such a mistaken cause could not becorroborated unless a mistake was made during thecorroboration as well. Finally, both those mistakes,in cause identification and cause corroboration, haveto be subtle enough to fool all the members of theteam Who were interpreting the item. Under thatcircumstance a cause couldbe identified mistakenlywith possibly adverse consequences for the project.Hence, we conclude that the identification of a causeby mistake is unlikely but not impossible. To fur-ther reduce the chance of such a mistake occurring,it would be good to identify items in the tables thatwere a result of chance effects and eliminate themto avoid such mistakes.

If, correctly, no cause was found for the item, theitem would be dismissed from further consideration.But it would be more effective if one knew to whatextent the item being dismissed was a chance occur-rence. If the likelihood was high, one could indeeddismiss the item. Else, one could devote more effortto find the cause.

In summary, there are three adverse effects thatcan occur on account of chance effects. The teammay waste time considering items that occurred bychance, the team may find 8 cause by mistake forsuch an item, and, finally, the team may dismiss anitem that should be pursued further. We develop anapproach based on statistical significance below toaddress those concerns.

5 Statistical significanceIt is hardto characterize the effects that occurred

by chance when the data are generated by a com-plex physical process such as software development.Indeed, if it was possible to identify all variationsin that process, it would imply that we have a verygood understanding of how software should be de-veloped. The truth is we do not, and we are along way from such an understanding. Hence, theapproach we use to address chance occurrences is


based on the well-known concept of statistical gig-nificance (See Chapter 6, [3]). To develop such approach we do not require a sound understandingof software development; Instead, we use the fol-lowing device. We define a random process whichcan generate classified defect data. Then, we deter-mine how easily that process could replicate an itemthat was observed in the actual defect data. If aneffect is easily replicated, we have shown that it caneasily occur by chance. Else, the effect is unusual,since it is not easy to replicate using the randomprocess. The ease with which the random processcan replicate an item is referred to as the statisticalsignificance of that item.

g.1 Statistical significance based onmagnitude

The calculation of the statistical significance forthe items produced by the function/I can be per-formed as follows:

Recall, that the proportion of data records withvalue a for attribute X is p(X = a) = N(X a)/N(X), where N(X) = number of cases for whichthe value of attribute X is defined, and N(X = a)is the number of cases with value a for attributeX. Under the hypothesis that all categories areequally likely, the expected value of this proportion,E(X, a), was I/Cho~e(X) Equation 2.

The classical notion of statistical significance isbased on the tall probability beyond the actually oh-served value, under the theoretical distribution of astatistiC. Let us explain what that means. Let us de-fine the following random process to generate classi-fled defect data. We assume that all defects are gen-erated independently and that a defect is classifiedX = a with probability E(X, a) = 1~Choice(X).Let us determine how easy or difficult it would befor a random process to replicate p(X = a). Thisis easily done as follows. If p(X, a)is greater thanor equal to E(X, a), we can determine the proba-bility with which N(X "- a) or more defects couldbe classified X -- a. That probability would giveus an ides of how easily the random process wouldproduce an effect that was equal or greater in magni’-tude than P(X= a). Ifp(X, is less than B(X,a),we can determine the probability with which defectsless than N(X = a) could be classified X = a. Thatprobability would give us an idea of how e~sily therandom process would produce an effect that wassmaller in magnitude than P(X - a).

This probability, commonly known as a P-vai~eis an indicator of the likelihood of an observed dif-ference occurring by pure chance. If the P-value issmall, suchas .001, then the observed difference ishard to replicate using the random process. Thiscan be considered as evidence that a true differ-ence exists, in the underlying software developmentprocess. If the P-value is large, such as .4, thedifference can be easily replicated by the randomprocess. This can be considered as evidence thatthe difference may not exist in the software devel-opment process and may have occurred by chance.The P-value is an inverse measure of ’unusualness’-

the smaller the P-value, the harder it is to believethat a deviation is a chance occurrence. The formaldevelopment is given below.

Under the random sampling assumption, the P-value for testing whether the observed proportionp(X = a) is consistent with the null hypothesis~(X, a) can be calculated as a tail probability un-der a binomial distribution. Note that this distrih-ution is conditional on the total number of recordsN(X) for which the value of attribute X is defined.The parameters of the binomial distribution are: thenumber of trials - N(X), andthe probability of’succ.’ at each p =

We use either the upper or the lower tail of thebinomial distribution, depending on whether the oh-served proportion p(X = a) is above or below theexpected proportion E(X, a).

Let b(z;n,p) denote the binomial probabilityfunction. This Is the probability of z successes in n

¯ trials, when the probability of success at each trialis p. It is given by the equation:

b(z;n,p) = (~ ~ p’(l _p)n-, (3)\-/

The relevant tall probability can be defined as:

X,x(X,a) = E~=(ox’a) b(i;n,p), ifp(X - a)

a) = p), if pCX = a) > where ,, = and p = g(x, a).In practice, the calculation of the above proba-

bilities is best accomplished using readily availableprograms to compute the binomial cumulative dis-tribution function. The computational cost is muchless than the above formula would suggest, since fastand accurate algorithms are available for most casesof practical interest.

Ezample: Consider two attributes X and Y with2 and 20 values, respectively. Assume that for eachattribute, all Values are equally likely, so that theexpected proportion in each category is .5 for X and.05 for Y. Suppose that the actual data contains i00records with values of X and Y defined, and 55 ofthese have X = a and 8 have Y = b, where a and bare two specific values of attributes X and Y.

/1 gives:I~(X = a) = .55 - .5 = .05, andz1(Y = b) = .OS- .05 = .03.Now consider the statistical significance I, lof the

. two attribute values:100

I,I(X, a) = ~’~-==s6 b(z; 100, 0.5) = 0.16100

I,I(Y, b) = ~,-s b(z; 100, 0.05) = 0.11Notice that in this example, /1 gives a higher

value to the first deviation than the second, but thestatistical significance 1,i indicates that the latter isthe more unusual occurrence, in the sense that it isless likely to occur by pure chance. This situationis an exception rather than the rule. In general, thelarger deviations will tend to have smaller P-values.

6 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

5.2 Statistical significance based on as-sociation

A similar approach is used to derive P-values foritems produced by/2. The formal development isgiven below.

For a given pair of attributes (X, Y), let N(X, Y)be the total number of cases for which both X andY are defined. For a specific pair of values (a, b) (X,Y), let N’(X = a) IV’( Y = b) bethe n um-ber of cases with X = a and Y b, respectively.Let p(X = al Y = b) be the proportion of cases withboth X = a and Y = b. The corresponding expectedproportion under the hypothesis of statistical inde-pendence of the attributes X and .Y (conditional on~(X - a) and N(Y = b)) is E.b - p(X - a)p(Y ""b). The function I~(X = a,Y = b) measures theabsolute difference between the actual and the ex.pected proportions. We define the correspondingstatistical significance as follows.

For short, let us write, p. = p(X - a), Pb -"p(Y =b), and p,~ = p(X = a,Y = b). Let,

r.,2(X, a, Y, b) = Pt.ob(g <_ N(X - a, Y = b)), Pa~ < PaPb,

1,2(X, a, Y, b) - Prob(Z >_ N(X - a, Y - b)), P.~ > papb.

I,~ measures the tail probability of a value as ex-treme as or more extreme than the actual observedcount N(X = a, Y = b). We calculate this probabil-ity conditional on the marginal totals N’(X = a),Iv’(Y = b) and IV(X,

The exact calculation of/,~ is based on the hyper-geometric distribution. The probability of a specificvalue ffi within the range of g is:

where, IV = N(X,Y), A = N’(X = a) and IV’ Y = b).

~he significance level is calculated by summingh(z; IV, A, B) over the left or right tail as indicatedabove. In practice, the computation can be done us-ing the cumulative distribution function of the hy-pergeometric distribution, available in major statis-tical packages.

Ezample: Table 4 shows data on three pairs ofattribute values, taken from a data set with a totalof 71 observations.

Notice that the second pair has a higher value ofthe function Is than the first one, but the P-value(I,2)suggests that the first pair is the more unusualof the two. Also observe that the first and third

~airs have nearly equal values of/2, but there is aig difference in their P-values.

6 Empirical results and lessonsOn surface, there seems to be an obvious way

to incorporate the statistical significance approachin Attribute Focusing. As discussed in Section 5,the smaller P-values suggest the more unusual itemswhile larger P-values suggest that the items could

have occurred by chance. Usually, some small P-value such as 0.05 is used to decide which effectsare statistically significant. Effects that have a P-value beyond that small value are considered statis-tically insignificant while effects that have a smallerP-value are considered statistically significant.

Conventional wisdom would suggest that such asmall P-value be used to remove the statistically in-significant items from the tables that are studiedby the project team. Such removal would eliminatethe concerns mentioned in Section 4, since itemsthat were likely to be a result of chance occurrenceswould simply not be considered by the team. How-ever, the empirical results below show that such re-moval is not a good idea.

Table 5 is based on the data from the prgject ex-perience that was reported in detail in [I6]. Therewere seven process problems that were identified inthat experience as a result of Attribute Focusing.Those problems are labelled A through G in thetable. The first column, T~ble, indicates the differ-ent tables that were interpreted by the team in fivefeedback sessions. The tables are identified as XSor XP, where X is a numeral that indicates thatthe table was interpreted in the X’th feedback ses-sion. S indicates the table involved single attributevalues since it was based on I1, while P indicatespairs of attribute values since the table was basedOn/2. There is one table that is specified as l+2P.That table was based on/2 applied to the combineddata from both the first and second feedback ses-sions. Such combination is used for validating thatthe desired effect of corrective actions that were im-plemented as a result of earlier feedback sessions isreflected in the later data (see Section 2.2.3 [16] forspecific details, Chapter 12 [18J for general examplesof validation). Sometimes, if there are items that gobeyond such validation, the combined data is alsobe explored by the team. Those tables have beenincluded in Table 5.

The column "1.00" summarizes informationabout the project experience as reported in [16].The number of items that were interpreted by theteam in every table is provided along with theprocess problem identified as a result of that in-terpretation. For instance, IS had 23 items. Theinterpretation of those problems led to the identifi-cation of problems A and B. 1P had 47 items buttheir interpretation did not lead to the identificationof any problems, and so on. The other columns inTable 5 contain information about the effect on a ta-ble had items beyond the significance level specifiedat the head of the column been eliminated from thetable. For instance, let us look at the column withthe heading "0.05". We see that IS would have beenreduced to a table with only 6 items had we elimi-nated items that had a significance level greater than0.05. The item corresponding to Process problem Bwould have been retained after such reduction. Thecolumn S/ze specifies the sample size or the numberof defects in the data that was used to create thetables. XS and XP were created from the samedata, hence, only one size is provided, in the row

KDD-94 AAAI.94 Workshop on Knowledge Discovery in Databases Page 67

value1 value2

a b

c d

c ¯

TABLE 4

obsl obs2 obsl2 expec diff P-value

10/71 e/71 s/71 .14,.0s

35/71 17/71 12/71 .49..24:49~ :24~ :17~ :12~ 5~ 4.1~

35/71 24/71 14/71 .49*.34:49~ :34~ :20~ :17~ 3~ 20~

corresponding to XP. The information in the lastthree rows of the table will be explained in the nextsection.

. The above table shows that it is possible for itemsthat are statistically insignificant to be physicailysignificant, i.e., they are not chance occurrences butinstead have a definite physical cause. In otherwords, if items are eliminated from the tables basedon some small P-value, it is possible that some ofthose items are the very items that would have ledto the identification of process problems. For in-stance, conventional wisdom would suggest that allitems that had a significance level more than 0.05should be el/minuted from the tables. From column~0.05", we see that while such reduction would in-deed reduce the size of the tables to be interpreted,it would eliminate four of the seven items that ledto the identification of process problems. Only theentries corresponding to B,D and G will be retained.Hence, had we used such reduction, it would havebeen d~cult for the project team to identify theother process problems.

There are fundamental reasons why items can bestatistically insignificant but physically significant.First, note that the existence of such apossibilitycan be inferred from our approach itself. The in:formation on statistical s/gnificance is extraneous tosoftware development, i.e., we did not require anyknowledge of software development to compute theP-values. Instead, we used knowledge of the behav-ior of a process that generated defects randomly.Therefore, if an item in a table has a low statistical

~t canoe, it tells us that we could replicate thatrather easily by using a random process. It

does not tell us, necessarily, whether that effect iseasy or difficult to replicate using the software de-velopment process. In other words, statistical sig-nificance does not necessarily suggest physical sig-nificance.

Let us acquire a deeper understanding by exam.ining suitable examples from Table 5. We will makeuse of the process problems A, C, E, F which wouldhave been missed had we used the significance levelof 0.05 to limit the number of entries to be presentedto the team.

6.1 Hidden causesFundamentally speaking, the items correspond-

Lug to process problems A and F were statisticallyinsignificant but physically significant for the samereason: the presence of a hidden cause. Let us illus-trate that reason by using Process problem F. Theidentification of Process Problem F occurred whenthe team considered the association between f~c-~ion and m@aing which is presented in the first rowin Table 3. The statistical significance for that as-sociation (computed using Equation 4) was foundto be 0.39. In other words, the random processmodel that was used to derive Equation 4 could have

¯ produced that association or a stronger associationnearly 40% of the time. The high P-value suggeststhat the association between j~nc~ion and m~singis statistically weak.

We described how Process problem F was foundin Example i in ¯Section 3.2. Let us go back to that

¯ description. Recall that the team found that all de-fects that were classified both function and miuingpertained, to a specific functionality, namely, recov-ery - the ability of the system to recover from anerroneous state. Note that the classified data didnot capture any information about that function-ality. In other words, it was the hidden cause forthe weak association between m~zsOzg and ~Izc~ion.Another way to look at it is that had the classifieddata captured information about the specific func-tions that were implicated by a defect, we wouldhave seen an association between missing and reco~-ee//and another association between ~nc~ion and~eco~ew. Those associations would have been moresignificant than the the weak association between.~nction and missing.

An informal way to understand this is as follows.An association may be viewed as an overlap be-tween two categories. Since the classified data didnot capture information about the recovery func-tionality, we did not see the strong overlaps between~znctio~and ~eo~ew and miuing and reco~ew. In-stead, we saw only the indirect effect of those over-laps, namely, the weaker overlap between m~sin9and function defects which had occurred as a conse-quence of the hidden overlaps. The recovery func-tion was a hidden cause underlying the observed as-sociation.

Page 68 AAAI.94 Workshop on Knowledge Discovery in Databases KDD-94

TABLE S

Statistical Significance Levels

Table 1.00 0.05 0.10 0.20 0.30 0.40 0.50 Size

15 23,A,B 6,B IO,B 21,A,B 21,A,B 21,A,B 21,A,B

1P 47 3 5 8 14 18 29 30

2S 23 22 23 23 23 23 23

2P 33,C 7 17 23,C 27,C 33,C 33,C 74

1+2P* 56,D 38,D 47,D 49,D 51,D 54,D- 56,D 104

35 28 27 28 28 28 28 28

3P 62 39 44 52 60 62 62 121

45 35 28 30 33 33 33 35

4P . 93,E,F 32 40 57 68 82,F 85,E,F 59

5S 39 39 39 39 39 39 39

6P 36,0 35,G 36,G 36,G 36,G 36,G 36,G 239

Sum 475 276 319 369 400 429 447

Hitm 7 3 3 5 6 6 7

Pate 68 92 106 73 80 72 64

KDD-94 AAAI-g4 Workshop on Knowledge Discovery in Databases Page 69

This is a good point to consider the difference be-tween conventional hypothesis testing and hypoth-esis testing in the context of knowledge discovery.The identification of hidden causes is always a con-cern in statistical analysis. In the context of ex-ploratory data analysis, that concern̄ is exacerbatedfor the following reason. In conventional statisticalanalysis, a carefully designed experiment is used totest a given hypothesis. For instance, one may testthe hypothesis: Is smoking associated with cancer.A carefully designed experiment to collect and ana-lyse data that screens out many hidden causes canbe conducted to test that hypothesis.

In exploratory data analysis, there is not a sin-

~loe hypothesis that one is interested in. In fact, wenot know what hypotheses are interesting. Butthen, how can one guarantee that one is collectingthe right information in the first place to test a givenhypothesis in a rigorous fashion ? The answer is onecannot.

It is easy to appreciate the above argument inthe context of software development. Recall fromSection 2 that Attribute Focusing is used to findprocess problems that the team does not know tolook for. Well, since we don’t know what we aregoing to find, how can we know we are capturingthe right data to find it. Clearly, we cannot know.Process problem F is a good example. It showsthat the data did not capture relevant informationon specific functions such as recovery. More likelythan not, if a complex process such as software de-velopment is being used to generate data, that datawill not capture all the relevant information thatis required to indicate a problem. Instead, it willcapture some effect of the unknown problem and itwill be up to the team to identify .the hidden cause.If statistical significance is used to eliminate items,those ifidirect effects can be eliminated as well, andconsequently, the team will not identify the hiddencauses that underlie such effects.

6.2 The advantage of highlighting infor-mation

Process problems C and E were physically sig-nificant but not statistically significant for the samereason. Let us use the identification of E as an ex-ample to understand that reason. The identificationof Process Problem E occurred when the team con-sidered the disassociation between/ng/a and back-ward compatibilitl/which is presented in the secondrow in Table 3. The P-value for that disassocia-tion (computed usin$ Equation 4) was found to 0.47. In other words, the random process modelthat was used to derive Equation 4 could have pro-duced that disassociation or a stronger disassocis-tion nearly 50% of the time, indicating that the lackof association was rather weak.

We described how that problem was found in Ex-ample 2 in Section 3.2. Let us go back to that de-seription. Recall that the problem was found be-cause the team realized that there should be a strongassociation between India and bac]cward compatibii-/ill instead of the disassociation which was called to

their attention.Consider what would have happened if we elimi-

nated that disassociation from Table 3 based on itsweak statistical significance. The team would havehad difficulty realizing that there should be a strongassociation between India and backuJa1"d compatibil-it~ since there would be no entry in the table tocall their attention to it. However, had they noticedthe absence of the strong association, they wouldhave reasoned exactly as they did to find the processproblem. In other words, the information on statis-tical significance is entirely consistent with the lineof reasoning used by the team to find the processproblem. Omitting the entry simply makes it harderfor the team to notice the problem while retainingthe entry serves to highlight the problem.

7 Lessons, discussion and conclusionsLet us return to the concerns that were summa-

rized at the end of Section 4. The first concern had¯ to do with¯ saving the time spent on interpretingitems that occurred by chance. If there was a wayto clearly identify those items, such removal wouldindeed be a good idea. We believe such identifica-tion is hard when the data are generated are by acomplex physical process such as software develop-ment. Hence, we believe our approach developed us-ing statistical significance is a reasonable andprag- -matic way to try and address chance effects. Asisclear from the reported experience, we should not

try to save the team’s time by limiting the numberof items that are explored on the basis of a smallP-value.

Well, then how should we proceed ? The lastthree rows of Table 5 suggest a promising directionfor further research. The row Sant is simply the sumof of the items totalled over all interpretation ses-sions that would have been considered by the teamif we used a particular significance level to selectitems. The row Hits indicates the number of processproblems that would have been found, while Rate isthe division of S1t,n by Hits. Hence, Rate gives usan idea of the rate of discovery, i.e., the number ofitems interpreted to find a problem. We see thathad we used a significance level of 0.50, the rate ofdiscovery would have been improved slightly. How-ever, if we used conventional wisdom and used 0.05to select items, the rate would have been substan-tially worse.

The above argument suggests that while it is pos-sible to improve the rate of discovery by using thesignificance level, it is also possible to reduce therate substantially. More work is required to identifythe right level that should be used. This directionis consistent with other work on data exploration.For instance, Hoschks and Klosgen [5] (see Page 333of the reference) talk about varying the significance

¯ levels with sample size. Larger significance levelsare used with smaller samples and vice versa. Thecolumn SLze in Table 5 indicates the sample sizesfor the experience reported in this paper. We seethat the smaller sample sizes, 30, 59, and 74, doindeed require the larger P-values to cover all prob-

Page 70 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

lems. However, we also see that the sample of size59 requires the largest P-value to cover all problemsand not the sample of size 30, indicating that therelationship of sample size to P-value may not be alinear one. We plan to investigate such heuristics inthe future. It is .also plausible that the right signifi-cance level not only depends on the sample size butalso varies from domain to domain. Hence, empiri-cal studies are essential. For instance, we may findthat a P-value of 0.50 almost always improves therate of discovery for defect data, and hence, is a use-ful heuristic to select it,’m- in defect deAe. We planto do more empirical studies to fiud such heuristics.

The results in Table 5 suggest that it may benece~ary to explore statistically insignificant items.If statistically insignificant items are to be explored,then how do we address the remaining two concernsraised in Section 4 ?

The other two concerns that were raised in Sec-tion 4 can be addressed by using the information onstatistical sisnificance. One c?ncern was that if acause could not be determined even after an itemwas investigated, that item would not be consideredfurther. Using the information on zt6fistica] signifi-cance, the team can decide if that should indeed bethe ease. If the item is statistically significant, theymay decide to continue to investigate that item.

Another concern was that a sequence of errorson the part of the team cOUld lead them to find acause based on an item that had occurred purely bychance. The team can use the information on sta-tiztical siSnificance to double check their findings. Ifan item that was statistically insignificant was foundto be physically significant, their findings should beconsistent with the item being statistically insignifi-cant. A hidden cause that was clearly corroboratedU in Example I, or an item that was merely high-lighted-u in Example 2 (Section 3.2), are examp~lesof such consistency.7.1 Limitation and strength

Let us make explicit the limitation and strengthof the work that has been presented in this paper.They are transposed in point-counterpoint fashion.

¯ Limitation- As is the case with empiricalstudies, strictly speaking, the conclusions arespecific to the conditions of the experiment.Hence, the results presented here tell us primar-ily about the application of attribute focusingto one software project.

¯ Strengths While the results are indeed specificto one experiment, it is useful to learn from theunderlying explanations since they may havegeneral implications. Furthermore, there is notmuch available by way of field experience in theliterature on knowledge discovery. Since the re-sults are based on data from a fielded applica-tion, such an exercise is especially relevant.

Accordingly, (what we believe are) the generallessons to be drawn from this paper are listed be-low. The supporting evidence is indicated withinparentheses.

7.2 General ImplicationsIn the context of knowledge discovery:

1. In any given set of data, there are very fewstatistical hypotheses that will actually lead tothe discovery of knowledge. (Note that Table shows that only seven problems (based on sevenrelationships) were found from five sets of data.Field data from the application of attribute fo-cusing to software development has shown that,on average, two to three such relationships arefound per set of data [11, 17]).

2. The data are not collected to support testingthe validity of a particular statistical hypoth-esis. (See the related discussion towards thesecond half of Section 6.1).

3. Data will often come from processes that arepoorly understood. It is hard to characterizethe statistical properties of a complex data gen-erating process. Hence, statistical validation ofhypotheses in¯data from such processes will usu-ally be done by using general hypothesis testingmethods such as statistical significance. (See

¯ the discussion at the beginning of Section 5.See also the work by Courtney and Gustafson[19] for another example of such use).

4. Items 1, 2 and 3, above suggest that while theuse of statistical hypothesis testing methodsimay often be the only practical way to elimi-nate relationships that occur by chance, theywill have to be used carefully to make sure thatone does not also eliminate from a set of datathe few relationships that are useful.

7.3 ConclusionsIn the context of an exploratory data analysis

technique such as Attribute Focusing, knowledge ofstatistical significance:

¯ Should be used carefully when selecting dif-ferences which should be studied to identifycauses. Conventional wisdom would suggestthat all differences that lie beyond some smalllevel of statistical significance be eliminatedfrom consideration. Our results show that suchelimination is not a good idea.

¯ May be useful to improve the rate of discov-ery of causes during exploration. More work isrequired to understand how such improvementc&n occur.

¯ Can be used to address the concern that the!temsbeing explored have occurred by chancem two ways. First, ff a cause is found whilereasoning about a a statistically insignificantitem, that line of reasoning should be exam-ined to ensure it is consistent with the item be-ing statistically insignificant. Second, if a causeis not found for a statistically significant devi-ation, further investigation of the item shouldbe considered prior to dismissing that item.

KDD-94 AAA1-94 Workshop on Knowledge Discovery in Databases Page 71

8 AcknowledgementsWe would like to thank 3arir Chaar, Ram

Chillarege, Mike HaUiday, Art Nadas and Peter San-thanam, all of IBM, for the support and insightsthey have given us during the course of this work.

References[1] G. Pistetsky-Shapiro and W. Frswley, eds., Knowl-

edge Discovery in Databases. Menlo Park, CaLifor-nia: AAAI Press/The MIT Press, 1991.

[3] C. Matheus, P. Chan, and G. Piatetsky-Shapiro,"Systems for knowledge discovery in databases,"IEBE Trar~action~ on Knotvledge and Data Engi-neering, vol. 5, pp. 903-913, December 1998.

[3] B. Ostle and It. Mendug, Statistics in Research.Ames: Iowa State Univendty Press, 1975.

[4] V. Dhar and A. Tujl,~];,~, "Abstract-driven pat-tern discovery in databases," IBEB T~nJactions onKnowledge and Data Engineering, vol. 5, December1993.

[5] P. Hosch]m and W. Klosgen, UA support systemfor interpreting statistical data," in KnowledgeDiscovery in Databases (G. Piatetsky-Shapiro andW. Fzawley, eds.), Menlo Park, California: AAAIPress/The MIT Press, 1991.

[6] W. S. Humphrey, Managing the Sofhoare Process.Addison-Wesley Publishing Company, 1989.

[7] R. A. Radice, N. K. Roth, A. C. O’Hara, and W. A.Ciarfella, "A Programming Process Architecture,"IBM Systems 3ourna/, vol. 34, no. 2, 1985.

[8] A. Endres, "An Analysis of Errors and Their Causesin ~ystem Programs," IEEE Transactions on Soft-ware Engineering, vol. 1, no. 9, pp. 140-149, 1975.

[9] V. It. Basili and H. D. Itombach, "Tailoring theSoftware Process to Project Goals and Environ-meats," in Proc. of the International Conference onSoftware Engineering, pp. 345-357, April 1987.

It. Chillarege, I. Bhandafi, J. Chaar, M. HallS-day, D. Moebns, B. K. P~y, and M. Y. Wong,"Ottbogonal defect cl~sification - a concept forin-process measurements," IBBB Transactions onSoyhoarc Bngineering, vol. 9, November 1992.

I. Bhandari, "Attribute Poc~,~g: Machine-Asskted Knowledge Discovery Applied to SoftwareProduction Process Control," Knowledge Acqui~i.tion. Accepted for publication. Abbreviated re-zion appeared in Proceedings of the Workshop onKnowledge Discovexy in Databases, AAAI Techni-cs] Itepo~t Series, Number WF-93-02, July 1993.

It. W. Selby and A. A. Porter, "Learning from ex-amples: Generation and evaluation of decision treesfor software resource analyg~," IBBB Transactionson Software Engineering, vol. 14, pp. 1748-57, De-cember 1988.

[10]

[11]

[zs]

[13] L. C. Brland, V. R. Basils, find C. J. Hetmanski,"Developing Interpretable Models with OptimizedSet Reduction for Identifying High Itkk SoftwareComponents," IEEE ~nsactions on Software En.gineering, November 1993. To appear in a specialissue on Software Iteliability.

[14] I. S. Bhandari and N. K. Itoth, ~Post-process Feed-back with and without Attribute Focusing: A Com-parative Evaluation," in Proc. of the InternationalConference on So, are Engineering, pp. 89-98,May 1993.

[15]J. K. Chaar, M. 3. Halliday, I. S. Bhandari, andIt. Chi~arege, UIn-Process Metrics for Software In.spection find Test Eva]uations," IBBB Tratuac~onson Soflwarc Engineering, November 1993.

[16] I. Bhandari, M. Halliday, E. Taxver, D. Brown,J. Chaar, find R. Ch/IIsrege, "A case study of soft-ware process improvement during development,"IEEE Transactions on So#warc Engineering, De-cember 1993.

[17] I. Bhandari, M. Ha]liday, , 3. Chaar, R. Chillarege,K. Jones, J. Atkinson, C. Lepori-Costel/o,P. Jasper, E. T. adn C. Carransa-Lewis, andM. Yonesawa, "In-process Improvement throughDefect Data Interpretation," IBM Systems ,?our-no/, 1st Qtr 1994. To appear. A draft is availableas IBM Research Report Log 81922, 1993.

[18] A. Mayerhauser, Software Engineering- Methodsand Management. San Diego: Academic Press, Inc.,1990.

[19] It. E. Courtney and D. A. Gustafson, uShotgunCorrelations in Software Metrics," So#ware Engi.nesting Journal, January 1993,

Page 72 AAA1.94 Workshop on Knowledge Discovery in Databases KDD-94

on the role of statistical significance in exploratory ... · for exploratory data ana[ysis called...

Documents