00264___4116fe22aff416d76a2d228d5737eb30.pdf

7/27/2019 00264___4116fe22aff416d76a2d228d5737eb30.pdf

1/1

248 T. M. Khoshgoftaar and E. B. Allen

A project 's developmental history can be captured by information systems.Many software development organizations have very large data bases for configurat ion management and for problem report ing which capture data on events duringdevelopm ent. Such dat a bases are po tentia l sources of new information relatingsoftware quality factors to the attributes of software products and the attributes oftheir development processes. For large legacy systems or product lines, the amountof available da ta can be overwhelming. T he combina tion of num erous att rib ute sof software products and processes, very large data bases designed for other purposes, and weak theoretical support [Kitchenham and Pfleeger (1996)] mandates anempirical approach to software quality prediction, rather than a strictly deductiveapproach [Khoshgoftaar et al. (2000)].

Fayyad (1996) defines knowledge discovery in data bases as "the nontrivial process of identifying valid, novel, po ten tially useful, and ultim ate ly un de rsta nd ab lepatterns in data". Given a set of large data bases or a data warehouse, major stepsof the knowledge discovery process are [Fayyad et al. (1996)]: (1) selection andsampling of data; (2) preprocessing and cleaning of data; (3) data reduction andtransformation; (4) data mining; and (5) evaluation of knowledge. Fayyad restrictsthe term data mining to denote the step of extracting patterns or models from clean,trans form ed da ta , for exa mp le, fitting a mo del or finding a p at te rn . Classification-tree modeling is an acknowledged tool for data mining [Glymour et al. (1996), Hand(1998)].

Knowledge discovery in general, and the data mining step in particular, is focused on finding patterns and models that can be interpreted as useful knowledge[Fayyad et al. (1996)]. Industrial software systems often have thousands of modules,and a large number of variables can be extracted from source code measurements,configuration management data, and problem reporting data. The result is a largeamo unt of mult idimensional d ata to be analyzed by the d ata m ining step. Classification trees can be used as a data mining technique to identify significant andimportant relat ionships between faul ts and software product and process at t r ibutes[Khoshgoftaar et al. (1996a), Porter and Selby (1990), Troster and Tian (1995)].

This paper introduces the Classification And Regression Trees ( C A R T ) algori thm[Breiman et al. (1984)] to software engineering practitioners. A "classification tree"is an algo rithm, dep icted as a tree grap h, tha t classifies an inp ut o bject. Altern ative classification techniques used in software quality modeling include discriminantanalysis [Khoshgoftaar et al. (1996b)], the discriminative power technique [Schnei-dewind (1995)], logistic regression [Basili et al. (1996)], pa tte rn recognition [Briandet al. (1992)], artificial neural networks [Khoshgoftaar and Lanning (1995)], andfuzzy classification [Eb ert (199 6)]. A classification tre e differs from the se in th eway it models complex relationships between class membership and combinationsof variables.

C A R T automatically builds a parsimonious tree by first building a maximal treeand the n pru nin g it to an ap pro pria te level of detail . CART is attr ac tive becauseit emph asizes pru nin g to achieve robu st mo dels. Alth oug h Kitch enh am briefly

00264___4116fe22aff416d76a2d228d5737eb30.pdf

Documents