220920020-libre

2
1 Estimation of Software Defects Fix Effort Using Neural Networks Hui Zeng School of Information and Technologies George Mason University Fairfax, VA 22030 [email protected] David Rine Department of Computer Science George Mason University Fairfax, VA 22030 [email protected] Abstract Software defects fix effort is an important software development process metric that plays a critical role in software quality assurance. People usually like to apply parametric effort estimation techniques using historical Lines of Code and Function Points data to estimate effort of defects fixes. However, these techniques are neither efficient nor effective for a new different kind of project’s fixing defects when code will be written within the context of a different project or organization. In this paper, we present a solution for estimating software defect fix effort using Self- organizing Neural Networks. 1. Introduction Accurately estimating software defects fix effort within a software development organization can improve the management and control of an organization’s software quality costs, resources directed toward software quality, and software maintenance efforts. Recently several researches show special interested in estimation of defects fix effort [1][2]. Most general techniques applied to estimate software development effort use parametric project size techniques using as Line of Code (LOC) and Function Points (FP) that are based on certain historical data. However, these estimation techniques do not perform well when they are used to estimate defects fix time [3]. The main reason is because defects fixes are not based on counting lines of rewritten code or function points within the application, but instead are based upon counts of defects and efforts in fixing them. There are no good relationships between project size and defects fix time. For example, a hidden bug may cause much more fix effort than general public bugs. Moreover, because defects exist in various domains, it is not easy to use FP approach to cluster all defects in the proper domains. Numbers of defects in different domains cause different defects fix time. Another limitation is parametric techniques require adequate historical data, and they fail to offer much help when estimating defects fix effort prior to a new project without enough historical data. Neural networks as one category of non-parametric techniques are usually suggested in estimation with incomplete historical data [4]. The advantage of neural networks (NN) is that they do not require more understanding with input data. They are self-adaptive techniques . The drawback is that NNs are not easy to represent and fewer statistical techniques can be applied. For an example, if the input data are categorized in loosely- structured free text [5], it is really tough for neural network to implement estimation. In this paper, we present a non-parametric estimation solution by using Neural Networks that can handle some symbolic input data categorized in loosely-structured free text for defect fix effort estimation. 2. Experiment Design and Results Our experimentation estimates defects fix effort is based on NASA IV&V Facility Metrics Data Program (MDP) data repository [6]. The MDP static defects data report contains defect data that remains constant throughout the life cycle of that defect. The critical problem is that defects fix effort in MDP is only based on each actual defect, not based on each type of defect. Moreover, there are no rigorous categories for these defects, and they are only categorized in loosely-structured free text [5]. In this paper, 106 samples corresponding to 15 different software defects fix efforts from MDP dataset KC1, after removing incomplete data to assess the performance on estimation. KC1 is one of metrics dataset with projects of C++ developments. Table1 depicts input variables of the estimation. Input Variable Description Types 1 Fix_Hour The actual number of man hours the fix took to implement Integer 2 Severity The severity of the defect 1,2,3,4,5 3 How_Found the stage in which the defect was found Acceptance Test, Analysis, Customer Use, Engineering Test Inspection, Mission Critical, Planned Test, Regression Test, Release_I&T 4 Mode The mode the system was operating in DEV02, DEV03, DEV04, OPS, Other, TS1,TS2 5 Problem_Type Specific reason for closure of error report Configuration,COTS/OS, Design,not a bug,source code 6 SLOC_COUNT The actual number of SLOC changed or added Integer Table 1: Input Variables for Defect Fix Effort Estimation 2.1 Feature Extraction The sixth variable , SLOC_COUNT, is an interval-valued variable whose value was normalized between 0 and 1. Manhattan distance was computed to generate its dissimilarity matrix between every two samples. Four nominal variables including Severity, How_Found, Mode, Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC’04) 0730-3157/04 $20.00 © 2004 IEEE

Upload: speedsrl

Post on 30-Sep-2015

1 views

Category:

Documents


0 download

DESCRIPTION

neural netwrks

TRANSCRIPT

  • 1Estimation of Software Defects Fix Effort Using Neural NetworksHui Zeng

    School of Information and Technologies

    George Mason University

    Fairfax, VA 22030

    [email protected]

    David Rine

    Department of Computer Science

    George Mason University

    Fairfax, VA 22030

    [email protected]

    Abstract

    Software defects fix effort is an important softwaredevelopment process metric that plays a critical role in software quality assurance. People usually like to applyparametric effort estimation techniques using historicalLines of Code and Function Points data to estimate effort of defects fixes. However, these techniques are neither efficient nor effective for a new different kind of projects fixing defects when code will be written within the context of a different project or organization. In this paper, we present a solution for estimating software defect fix effort using Self-organizing Neural Networks.

    1. Introduction

    Accurately estimating software defects fix effort within a

    software development organization can improve the

    management and control of an organizations software

    quality costs, resources directed toward software quality,

    and software maintenance efforts. Recently several

    researches show special interested in estimation of defects

    fix effort [1][2]. Most general techniques applied to

    estimate software development effort use parametric project

    size techniques using as Line of Code (LOC) and Function

    Points (FP) that are based on certain historical data.

    However, these estimation techniques do not perform well

    when they are used to estimate defects fix time [3]. The

    main reason is because defects fixes are not based on

    counting lines of rewritten code or function points within

    the application, but instead are based upon counts of defects

    and efforts in fixing them. There are no good relationships

    between project size and defects fix time. For example, a

    hidden bug may cause much more fix effort than general

    public bugs. Moreover, because defects exist in various

    domains, it is not easy to use FP approach to cluster all

    defects in the proper domains. Numbers of defects in

    different domains cause different defects fix time. Another

    limitation is parametric techniques require adequate

    historical data, and they fail to offer much help when

    estimating defects fix effort prior to a new project without enough historical data.

    Neural networks as one category of non-parametric

    techniques are usually suggested in estimation with

    incomplete historical data [4]. The advantage of neural

    networks (NN) is that they do not require more

    understanding with input data. They are self-adaptive

    techniques . The drawback is that NNs are not easy to

    represent and fewer statistical techniques can be applied.

    For an example, if the input data are categorized in loosely-

    structured free text [5], it is really tough for neural network

    to implement estimation.

    In this paper, we present a non-parametric estimat ion

    solution by using Neural Networks that can handle some

    symbolic input data categorized in loosely-structured free

    text for defect fix effort estimation.

    2. Experiment Design and Results

    Our experimentation estimates defects fix effort is based on

    NASA IV&V Facility Metrics Data Program (MDP) data

    repository [6]. The MDP static defects data report contains

    defect data that remains constant throughout the life cycle

    of that defect. The critical problem is that defects fix effort

    in MDP is only based on each actual defect, not based on

    each type of defect. Moreover, there are no rigorous

    categories for these defects, and they are only categorized in

    loosely-structured free text [5]. In this paper, 106 samples

    corresponding to 15 different software defects fix efforts

    from MDP dataset KC1, after removing incomplete data to

    assess the performance on estimation. KC1 is one of metrics

    dataset with projects of C++ developments. Table1 depicts

    input variables of the estimation.

    Input Variable Description Types

    1 Fix_HourThe actual number of

    man hours the fix took to implement

    Integer

    2 Severity The severity of the defect 1,2,3,4,5

    3 How_Foundthe stage in which the

    defect was found

    Acceptance Test, Analysis,

    Customer Use, Engineering

    Test Inspection, Mission Critical, Planned Test,

    Regression Test, Release_I&T

    4 ModeThe mode the system was operating in

    DEV02, DEV03,DEV04, OPS, Other, TS1,TS2

    5 Problem_TypeSpecific reason for closure of error report

    Configuration, COTS/OS,Design, not a bug, source code

    6 SLOC_COUNTThe actual number of

    SLOC changed or addedInteger

    Table 1: Input Variables for Defect Fix Effort Estimation

    2.1 Feature ExtractionThe sixth variable , SLOC_COUNT, is an interval-valued

    variable whose value was normalized between 0 and 1.

    Manhattan distance was computed to generate its

    dissimilarity matrix between every two samples. Four

    nominal variables including Severity, How_Found, Mode,

    Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC04) 0730-3157/04 $20.00 2004 IEEE

  • 2and Problem_ type were converted to binary variables. A

    contingency table shown in Fig.1 for binary data type was

    generated. An asymmetric dissimilarity was then produced

    based on the Jaccard coefficients shown in Eqn.1.

    Sample i

    Sample j

    pdbcasum

    dcdc

    baba

    sum

    ++

    +

    +

    0

    1

    01

    Sample i

    Sample j

    pdbcasum

    dcdc

    baba

    sum

    ++

    +

    +

    0

    1

    01

    pdbcasum

    dcdc

    baba

    sum

    ++

    +

    +

    0

    1

    01

    Figure 1. The contingency table for binary variables

    cbacbjid++

    +=),(

    (1)

    For n sample, n(n-1)/2 dissimilarity vector matrices can begenerated. In our experiment, two-thirds of n samples

    where n=106 were used for training a self-organizing neural

    network. The remaining of the one-third is reserved for

    testing the estimation performance of neural network. Two

    attributes of dissimilarity measurement derived from

    normalized SLOC_COUNT and Jaccard coefficients from

    four nominal variables are used as network input.

    2.2 Self Organizing Maps (Kohonen Networks)Kohonen network (Kohonen, 1990) is an unsupervisednetwork that has the abilities of Self-Organization. Among the architectures and algorithms suggested for artificial

    neural networks, the Self-Organized Map (SOM) has the

    special property of effectively creating spatially organized

    internal representations of the various features of input

    signals and their abstractions. SOMs belong to a category of

    NNs in which the neighboring cells compete in their

    activities by means of mutual lateral interactions, and

    develop adaptively into specific detectors of different signal

    patterns. The spatial location or coordinates of a cell in the

    self-organizing map match up to a particular domain of

    input signal patterns. The training group is used to train the

    weights of self-organizing NNs. After the network was

    well trained, all 68 samples were clustered into certain

    clusters to form a feature map. The probability distribution

    corresponding to various Fix_Hour values within each

    cluster was derived. The testing samples followed the same

    procedure as training samples for feature extraction and

    carried out a set of dissimilarity vector for each sample.

    Each vector was simulated by fed in the trained self-

    organizing map and produced an unknown probability

    distribution. We then compare this unknown distribution

    against the previous found probability distribution and

    validate performance.

    2.3 Probabilistic measurement for fix effort After SOM training, the known values of defects fix effort

    represented by variable Fix_Hour were assigned to the

    found clusters. The probability distribution of Fix_Hour

    within each cluster was computed. During the testing

    phase, each unseen sample was compared to all training

    sample vectors to generate 106 dissimilarity vectors. These

    vectors were fed into already trained self-organizing neural

    network. The simulation of testing can assign 106 vectors

    to corresponding clusters. The probability of the Fix_Hour

    can then be estimated.

    3. Performance Evaluation

    In order to evaluate the performance of our estimation effort

    prediction model, we used magnitude of relative error

    (MRE) as our evaluative measure [4]. As the histograms of

    defects fix effort can be grouped as 6 groups, we calculated

    average MRE and maximum MRE within each histogram.

    We also evaluated the estimation performance by using

    another NASA MDP dataset KC3 as our other testing data.

    KC3 is a metrics dataset with projects of Java

    developments. 70 defects data samples of KC3 are used in

    the estimation. The average MRE is from 7% to 23% and

    the maximum MRE is from 23% to 83% by using dataset

    K1, which indicates that the performance of estimation by

    using our method is robust, i.e. less than the excellent effort

    estimations norm of 30%. However, when we evaluate the

    estimation performance by using KC3 70 defects data as

    testing data, a poorer estimation result is derived, the

    average MRE is from 40% to 159%, and the maximum

    MRE increases from 180% to 373%.

    4. Conclusions

    We present our strategic solution of estimating software

    defects fix effort by using dissimilarity matrix and self-

    organizing neural networks for software defects clustering

    and effort prediction instead of existing project size

    techniques , in which defects fix effort (time) can be

    estimated by the number of defects in various domains. The

    experimental results indicate good performance when

    applied to estimates for similar software development

    projects. However, poorer performance results when

    applied to defects fix effort estimated for software projects

    with totally different development environments.

    Estimation techniques only perform well in family oriented

    software development environments, like product line

    development.

    5. References [1] A. Mockus, D. Weiss, and P. Zhang, Understanding and

    Predicting Effort in Software Projects, 25th International Conference on Software Engineering. May 03 - 10, 2003

    [2] S. Chulani, Bayesian Analysis of Software Cost and Quality

    Models, Ph.D Dissertation, Univ. of South California, 1999[3] K. Manzoor, A Practical Approach to Estim Defect-fix Time,

    http://homepages.com.pk/kashman/defectsEstimation.htm

    [4] M. R. Lyu, Handbook of Software Reliability Engineering,

    McGraw Hill, 1996.

    [5] T. Menzies and R. Lutz, Better Analysis of Defect Data at NASA, the 15th Intnl Conf. on Software Engineering and

    Knowledge Engineering, July, 2003.

    [6] NASA Metrics Data Program Site. http://mdp.ivv.nasa.gov/

    [7] Kohonen T. The Self-Organizing Maps, Proceedings of the

    IEEE, 1990 78, 1464-1480

    Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC04) 0730-3157/04 $20.00 2004 IEEE