quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf ·...

13
Quality, Productivity, and Learning in Framework-Based Development: An Exploratory Case Study Maurizio Morisio, Member, IEEE Computer Society, Daniele Romano, and Ioannis Stamelos, Member, IEEE Abstract—This paper presents an empirical study in an industrial context on the production of software using a framework. Frameworks are semicomplete applications, usually implemented as a hierarchy of classes. The framework is developed first, then several applications are derived from it. Frameworks are a reuse technique that supports the engineering of product lines. In the study, we compare quality (in the sense of rework effort) and productivity in traditional and framework-based software production. We observe that the latter is characterized by better productivity and quality, as well as a massive increase in productivity over time, that we attribute to the effect of learning the framework. Although we cannot extrapolate the results outside the local environment, enough evidence has been accumulated to stimulate future research work. Index Terms—Application framework, framework, product line, process quality, software reuse, empirical study, learning. æ 1 INTRODUCTION P ROJECT managers and software engineers have often experienced a sense of frustration when a function, module, or application must be developed that is similar to previously developed ones. In fact, the concept of software reuse is nearly as old as software itself. Earlier reuse models promoted the development of small-grained reusable components (for instance, proce- dures or classes) and their insertion in a reuse library. This model proved effective but limited because applications were able to reuse small-grained components only—reuse occurs at the code level. Application frameworks are a more promising reuse model, which aims to reuse larger-grain components and high-level designs. Based on object-oriented technology, they are defined as “semicomplete applications that can be specialized to produce custom applications” [14]. Usually, a framework is made up of a hierarchy of several related classes. With the function or class library model, one or a few isolated classes (or functions) are reused. With frameworks, reuse takes a completely different form, with much higher leverage. The application is built by slightly modifying the framework and by accepting in toto the framework’s high-level design. In other words, most of the framework is reused and reuse takes place both at the design level and at the code level. As part of this approach, the high-level design of the framework establishes the flow of control that has to be accepted by the application. This is called “inversion of control” since in traditional development the flow of control is defined by the application. Framework-based development has two main processes: development of the framework and development of an application adapting the framework. Developing a frame- work is more demanding than building an application. The designer should have a deep understanding of the domain the application is embedded into, and should anticipate the needs of future application developers. Domain analysis techniques, focused on modeling the scope of the domain, and commonalities and variabilities of applications, have been developed to guide the definition of framework requirements [7]. As a matter of fact, successful frameworks have evolved from reengineering long-lived legacy applica- tions, abstracting from them the knowledge of principal software designers. Developing an application from a framework (provided this is feasible, i.e., most of the requirements of the application are satisfied by the framework) is a differential activity: the framework is parameterized, modified, and extended. The advantages promised by framework-based devel- opment—better reliability, shorter time to market, lower cost—depend on not writing software. However, the frame- work has to be developed first and this means that several conditions have to be met: . The commitment by management to commence and sustain investment in framework development. . The stability of the domain on a time range long enough to have a positive return on investment. 876 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002 . M. Morisio is with the Dipartimento Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy. E-mail: [email protected]. . D. Romano is with the Dipartmento Ingegneria Meccanica, Universita`di Cagliari, Piazza d’Armi 09123 Cagliari, Italy. E-mail: [email protected]. . I. Stamelos is with Aristotle University, 54006 Thessaloniki, Greece. E-mail: [email protected]. Manuscript received 21 Aug. 2000; revised 26 July 2001; accepted 30 Jan. 2002. Recommended for acceptance by D. Batory. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112739. 0098-5589/02/$17.00 ß 2002 IEEE

Upload: others

Post on 10-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

Quality, Productivity, and Learningin Framework-Based Development:

An Exploratory Case StudyMaurizio Morisio, Member, IEEE Computer Society, Daniele Romano, and

Ioannis Stamelos, Member, IEEE

Abstract—This paper presents an empirical study in an industrial context on the production of software using a framework.

Frameworks are semicomplete applications, usually implemented as a hierarchy of classes. The framework is developed first, then

several applications are derived from it. Frameworks are a reuse technique that supports the engineering of product lines. In the study,

we compare quality (in the sense of rework effort) and productivity in traditional and framework-based software production. We

observe that the latter is characterized by better productivity and quality, as well as a massive increase in productivity over time, that

we attribute to the effect of learning the framework. Although we cannot extrapolate the results outside the local environment, enough

evidence has been accumulated to stimulate future research work.

Index Terms—Application framework, framework, product line, process quality, software reuse, empirical study, learning.

1 INTRODUCTION

PROJECT managers and software engineers have oftenexperienced a sense of frustration when a function,

module, or application must be developed that is similar topreviously developed ones. In fact, the concept of softwarereuse is nearly as old as software itself.

Earlier reuse models promoted the development ofsmall-grained reusable components (for instance, proce-dures or classes) and their insertion in a reuse library. Thismodel proved effective but limited because applicationswere able to reuse small-grained components only—reuseoccurs at the code level.

Application frameworks are a more promising reusemodel, which aims to reuse larger-grain components andhigh-level designs. Based on object-oriented technology,they are defined as “semicomplete applications that can bespecialized to produce custom applications” [14]. Usually, aframework is made up of a hierarchy of several relatedclasses.

With the function or class library model, one or a fewisolated classes (or functions) are reused. With frameworks,reuse takes a completely different form, with much higherleverage. The application is built by slightly modifyingthe framework and by accepting in toto the framework’s

high-level design. In other words, most of the framework isreused and reuse takes place both at the design level and atthe code level. As part of this approach, the high-leveldesign of the framework establishes the flow of control thathas to be accepted by the application. This is called“inversion of control” since in traditional development theflow of control is defined by the application.

Framework-based development has two main processes:development of the framework and development of anapplication adapting the framework. Developing a frame-work is more demanding than building an application. Thedesigner should have a deep understanding of the domainthe application is embedded into, and should anticipate theneeds of future application developers. Domain analysistechniques, focused on modeling the scope of the domain,and commonalities and variabilities of applications, havebeen developed to guide the definition of frameworkrequirements [7]. As a matter of fact, successful frameworkshave evolved from reengineering long-lived legacy applica-tions, abstracting from them the knowledge of principalsoftware designers.

Developing an application from a framework (providedthis is feasible, i.e., most of the requirements of theapplication are satisfied by the framework) is a differentialactivity: the framework is parameterized, modified, andextended.

The advantages promised by framework-based devel-opment—better reliability, shorter time to market, lowercost—depend on not writing software. However, the frame-work has to be developed first and this means that severalconditions have to be met:

. The commitment by management to commence andsustain investment in framework development.

. The stability of the domain on a time range longenough to have a positive return on investment.

876 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002

. M. Morisio is with the Dipartimento Automatica e Informatica, Politecnicodi Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy.E-mail: [email protected].

. D. Romano is with the Dipartmento Ingegneria Meccanica, Universita diCagliari, Piazza d’Armi 09123 Cagliari, Italy.E-mail: [email protected].

. I. Stamelos is with Aristotle University, 54006 Thessaloniki, Greece.E-mail: [email protected].

Manuscript received 21 Aug. 2000; revised 26 July 2001; accepted 30 Jan.2002.Recommended for acceptance by D. Batory.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 112739.

0098-5589/02/$17.00 � 2002 IEEE

Page 2: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

. The availability of highly skilled software designersand domain experts to build the framework.

Once the framework is available, developing an applica-tion from it means saving considerable time and effortcompared to the development of the application from zero.However,

. Software engineers have to be trained to use theframework.

. Highly skilled designers have to maintain theframework.

Currently, there is very little quantitative evidence tosupport project managers in decisions about framework-based development. What are, in quantitative terms, thegains in productivity, quality, and time to market whenusing a framework? How many applications have to bedeveloped before the investment in developing the frame-work is paid off and break-even point is reached? Howquickly can a programmer master a framework he/she hasnot developed?

The contribution of this paper is in the quantitativeanalysis of productivity and quality in framework-baseddevelopment, in comparison with traditional development.

Specifically, we planned a study where a singleprogrammer, a novice at the framework, produced fiveapplications fully based on a framework and four applica-tions based on traditional component development. Foreach application, we measure effort to develop it, effort tocorrect defects, size, complexity, and reuse level; we definetwo indices for productivity and quality of the developmentprocess and observe their mean figures and trend over time,also providing empirical predictive models. We have foundthat productivity and quality are substantially higher inframework-based development than in traditional develop-ment. We also report gains in productivity and quality overtime in both development modes. We attribute this effect tothe fact that the programmer learns more and more aboutthe application domain and the tools he uses. In particular,productivity improves in an impressive way in framework-based development.

Ideally, such a case study should have been performedinvolving several randomly selected programmers toaccount for variability within the programmers population.This study, however, was made in an industrial environ-ment: The resources needed for observing a sample ofprogrammers over a time long enough to assess theprogrammer’s learning were not available. The results ofour paper are based on the performance of a singleprogrammer. Of course, this is of no guarantee that theresults of our study generalize to larger groups. On theother hand, a merit of exploratory studies like this one is tostimulate novel research hypotheses which may be thenvalidated by more thorough and extensive investigations.

The paper is organized as follows: Section 2 describes thecontext of the study (the framework and the developmentprocesses used). Section 3 describes the empirical study,including goals and hypotheses, nature and criteria of theexperimental design selected, validity threats, analysis andinterpretation of results. A discussion in Section 4 concludesthe paper.

1.1 Related Work

We are aware of very few empirical studies regardingframework-based development. Mattsson [20] compares theeffort to develop the framework and the effort to develop anapplication from it. Thirty one applications were developedfrom four subsequent versions of the framework and theapplication development effort using the framework waslower than 2.5 percent of the effort to develop the frame-work itself in more than 75 percent of the cases. Heconcludes that framework technology provides reducedeffort for application development.

Moser and Nierstrasz [23] propose the System Meterapproach to estimate the effort involved in software projectsand compare it with Function Points on a set of 36 projects.A subset of these projects developed and used frameworks.The authors analyze the productivity for this subset andobserve that productivity does not change remarkably if thesize of the framework is omitted and only newly developedcode is considered. This is the same philosophy ofmeasuring size and productivity that we used in our study.However, our result is different, as net productivity (afterlearning) is higher for framework-based development thanfor traditional development.

Shull et al. [30] study the process of using and learning aframework to develop graphical user interfaces by studentsin an academic setting. He proposes that techniques basedon examples are the most suitable for supporting learning,especially for novice users.

In other empirical studies not related to frameworks [29],[25], [26], the authors consider learning as a perturbingfactor and design the experiment to neutralize its effect.

Although the general problem of deciding whether todevelop a framework or not should be addressed with theaid of economic models, we are not aware of modelsdeveloped specifically for frameworks. However, existingeconomic models for reuse [28], [8] could be adapted.

Srinivasan [31] points out that frameworks require asteep learning curve on the part of the reuser, somethingwhich has to be considered as another form of investment.

2 THE FRAMEWORK AND THE

APPLICATION-DEVELOPMENT PROCESS

The network division of a research and developmentcompany has identified many domains and subdomainswhere the commonality between applications is high and,therefore, reuse is a potential choice. Among the severalreuse techniques which have been tried, frameworks haveproved to be a very promising and feasible option.

The division decided to set up a study on framework-based development to obtain quantitative insight on theprocess. Specifically, the goal was to build models (cost,return on investment, reliability, learning curve) in order toguide technical and business choices in the years to come.

In the following, we describe the framework, theapplication-development process based on it, and theapplications developed during the study.

2.1 The Framework

The framework supports the development of multimediaservices on a digital network. It uses a CORBA infrastructure,

MORISIO ET AL.: QUALITY, PRODUCTIVITY, AND LEARNING IN FRAMEWORK-BASED DEVELOPMENT: AN EXPLORATORY CASE STUDY 877

Page 3: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

is developed in Java, and integrates some COTS (Commer-cial Off-The-Shelf) products.

The framework (see Fig. 1) is composed of two layers.The lower layer, composed of service platform and specialresources, offers network functionalities through CORBAAPIs. The higher layer, organized in components, abstractscontrol services.

The service platform offers stable, basic functions forservice access (request to activate a service session, requestto join a session), management of profiles of servicesubscribers, session control (coordination of the resourcesused in a session), and network resource control.

Special resources offer specific functions (e.g., reflector[receives multimedia data packets from several sources andresends them to a set of destinations], video bridge[abstracts a hardware bridge for video data], vocal gateway[abstracts a hardware gateway for audio data])—each onewith a dedicated API. This design allows the addition ofspecial resources to the framework when needed, withoutchanges to the service platform.

The components in the higher layer, implemented asseveral Java classes, belong to these families:

. UAP (User Application): UAPs are executed on theterminals of the service’s participants; they imple-ment the connection between terminal and networkservices. Given a service session, several UAPs canbe activated, one for each role played in the service.

. GSS (Global Service Session): GSSs contain thespecific network logic (e.g., access and roles ofparticipants, coordination of special resources) re-quired by a service. Given a service, a single GSS isactivated on the server of the service retailer.

. SP (Service Profile): SPs describe each participant ina service session. At least one SP for each role isactivated for a session; they all reside at the server ofthe service retailer and are persistent.

The framework is designed to be reused black-box [9],except for GSSs that have to be specialized from the baseclass (white-box). The framework is composed of 22 classesor around 10KSLOC (source lines of code) and wasdeveloped before this study. A 30-page document describesthe framework and is available to its users. The documentcontains a description of the components (UAP, GSS, andSP), how they work together, and how they should be usedto develop a service. In addition, an example of a service

derived from the framework is provided. The formalisms

used in the document are natural language, message

sequence charts, hierarchy diagrams, and composition

diagrams.

2.2 The Application-Development Process

As already mentioned, the application framework was

developed before the present study, which deals exclusively

with using the framework. Since framework-based applica-

tion development is different from traditional application

development specifically in the process used, the activities

involved when employing the framework are listed below.

. Requirements definition. The requirements of anapplication are specified by means of a Use Caseformat. Tool used: MS Word.

. Analysis. Starting from the requirements and theUse Cases, the analyst defines how to implementthe application using the framework. In thesimplest case, this means selecting which compo-nent from the families UAP, GSS, and SP should bereused and integrated into the framework. In mostcases, this will not be sufficient, so new componentswill have to be identified, specified, developed.Tool used: MS Word.

. Component development. This activity can occuronly if the previous one has identified new compo-nents to be developed. As already mentioned, acomponent is a set of Java classes. Starting from textspecifications developed in the activity Analysis, thecomponent is designed, an IDL interface is specified,classes are coded and tested, and the wholecomponent is integrated and tested. In actual fact,this activity consists of a traditional developmentprocess. The process is independent of the frame-work: the developer receives the component speci-fications and develops the required Java classes byabstracting from the framework. Tool used: Javadevelopment environment.

. Application development. The framework is para-meterized, after which the components identified inthe activity Analysis and possibly developed inComponent development are parameterized and inte-grated into the framework. The application is thentested informally. Tool used: Java developmentenvironment.

. Acceptance test. Test cases are generated from theuse cases in the requirement document and subse-quently these test cases are applied. Tool used: Javarun time environment, Orbix-web.

. Usage. The application is used in the field andfailures are logged in failure reports. Tool used: Javaruntime environment, Orbix-web.

. Corrective maintenance. Failures reported fromusage are repaired. Tool used: Java developmentenvironment.

3 THE EMPIRICAL STUDY

The description of the present study follows as far as

possible a general template suggested by Wohlin et al. [32],

878 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002

Fig. 1. Framework structure.

Page 4: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

which summarizes the manner in which most empiricalstudies in software engineering are presented: definition,planning, validity, and statistical analysis of results. Eachtopic contains in turn a number of issues. The first section,Definition, states the goal of the study (object andobjectives), discusses its nature (case study versus experi-ment), and describes the context (programmer and hisenvironment). Planning comprises the formulation of theresearch hypotheses, the selection of independent anddependent variables with a justification of the metricsadopted to express them, the plan (sequence of developedapplications plus their description) to be adopted, and itsdesign criteria. Validity discusses construct, internal, andexternal validity. In Statistical analysis, the statistical analysisperformed on the results (descriptive statistics, graphicalrepresentation, regression models) is described.

3.1 Definition

3.1.1 Goal of the Study

A general statement of the goal of our study is: To analyze

the development process of framework-based applicationswith respect to development without the framework for the

purpose of evaluating productivity, quality, and learningeffect in the context of web-based multimedia servicesdeveloped in an industrial setting by programmers whowere not the framework developers.

3.1.2 Nature and Context of the Study

The study described here is very much a compromisebetween a controlled experiment and a case study. The userwas well aware of possible limitations and validity threats(see Section 3.3) because of his decision to perform thestudy inside his own company under a rather severe budgetlimit and within a short time frame. Since the purpose of thepresent research is to assess the performance of developerswho have little or no prior knowledge of the framework, animprovement in productivity and quality over time isexpected and, therefore, investigating the learning effect is aprimary objective of the study. This is not a purelyacademic problem since a steep learning curve can be adecisive factor in making framework technology profitable,as already documented elsewhere [31]. Given the time andcost constraints of our situation, we decided to use only oneprogrammer and observe him during the development of anumber of applications (nine). The underlying rationale isthis: whenever significant changes over time occur inindividuals while they are receiving different treatments,it can be more informative to study one subject for severalhours than several subjects for one hour [17]. Therefore, wefavored control over the development-mode factor andinvestigation into the learning effect, sacrificing general-izations regarding programmer population. Furthermore,this is a typical situation since, in companies, programmerswho develop and maintain applications based on a frame-work tend to do this many times. We chose a computerscience graduate with a good knowledge of applicationdomain and expertise in object-oriented programming, abeginner in the language used (Java), and a novice to theframework. We will discuss this issue further in theforthcoming Validity section.

The developed applications belong to either a controlgroup (development without framework, corresponding tothe activity Component Development in the processdescribed in Section 2.2) or to an experimental group(development with framework, corresponding to all otheractivities in the process in Section 2.2). They have all beendeveloped through the same process, described inSection 2.2. However, for those in the control group, theeffort spent on the activity Component Development ispredominant—accounting for the large majority of theeffort spent. Such applications can be considered as variantsof an application previously developed. Here, the require-ments and design are basically the same as a previousapplication, but the new functionality requires the devel-opment of various new classes, which can be accomplishedthrough the activity Component Development. ComponentDevelopment is basically a traditional development pro-cess, where code is developed and tested starting fromcertain specifications, with no reference to the framework.We verified this point by specifically interviewing theprogrammer during and after the development of theapplications in order to ensure that component develop-ment occurred without any reference to or use of theframework. For these reasons, we have used applicationswhere Component Development is the predominant activ-ity as a control group. The characteristics of the twosubgroups are summarized in Table 1.

The present study can be defined as a single subject (orsingle case) experiment according to Harrison [12], who refersto a vast practice of such experiments in social science andmedicine and promotes their use in software engineering.The present study could also be denoted as a multiobjectvariation study—adopting the classification proposed byBasili et al. [3] and Wohlin et al. [32]. These latter argue thatour type of study is a quasiexperiment because the subject isfixed instead of being randomly selected across the projects,as should be the case in a canonical experimental set-up.However, the software engineering community is some-what reluctant to call an “experiment” an empirical studythat (even if it includes control factors) involves only onedeveloper since there is generally considered to be a widevariation in performance among programmers. Therefore,we call our study an exploratory case study with a singlesubject.

3.2 Planning

3.2.1 Research Hypotheses

The research goal is deployed in the following fourhypotheses to be tested.

H1. Development with framework provides higher net productiv-ity than development without framework.

This statement should not be read in a trivial sense, i.e.,reusing a framework shortens the time needed to producean application. Of course it does because a smaller amountof code needs to be written thanks to reuse. This is the basicassumption upon which the convenience of framework-based development lies (however, in certain contexts,increased reliability might justify the use of a frameworkeven with a lower gross productivity). The hypothesis to be

MORISIO ET AL.: QUALITY, PRODUCTIVITY, AND LEARNING IN FRAMEWORK-BASED DEVELOPMENT: AN EXPLORATORY CASE STUDY 879

Page 5: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

verified is whether, given the same amount of time, theprogrammer is able to write a larger piece of code if he usesthe framework. In this sense, the hypothesis is not trivial, asit assumes that a framework enables the programmer notonly to write less code but also to do so more quickly.Therefore, net size is considered, namely, the amount ofadditional code actually written in order to customize theframework and turn it into a specific application.

H2. Learning increases net productivity in development withframework more than in development without framework.

We expect an increase in productivity over time, nomatter which development mode is considered. Here, weare testing whether the productivity gain is faster inframework-based development than in traditional develop-ment. This would imply a competitive advantage in thelong run, after the initial expenditure for frameworkdevelopment is paid back.

H3. Development with framework is less prone to failures thantraditional development.

In the same fashion as for H1, this is meant in anontrivial sense. It is quite obvious that framework-baseddevelopment, as for any reuse technique, means writingless, and reusing more documents and code, than withtraditional development. Since framework modules arereused and field-tested in several applications, we expectthat—overall—the failure density of the application islower. However, the hypothesis means that if we comparetwo pieces of code of the same size developed (not reused)in the two modes, we find fewer errors in the frameworkmode. The rationale here is very close to that of H1, for it isreasonable to assume that an easier-to-write code willcontain fewer errors.

H4. Learning reduces failure occurrence in development withframework more than in development without framework.

In general, we expect that failure density should decreaseover time thanks to a learning effect. The hypothesisassumes that, in framework-based development, failuredensity decreases over time more quickly than in develop-ment from scratch. The rationale is the same as for H2.

3.2.2 Selected Variables and Metrics

The developed applications are characterized by a set ofattributes, which are the independent variables of the study.They are development mode, application domain, program-ming language, application size, cumulated size (from thefirst up to the ith application), complexity, and reuse levelof the framework. Some of these variables are controlled;some are not. The controlled variables are developmentmode, application domain, and programming language.The development mode is an experimental factor with twolevels (with/without framework). Application domain(multimedia networking services) and programming lan-guage (Java) are fixed in the present study. Admittedly,fixed variables confine the scope of the investigation, butthey do not introduce possible colinearities with any otherindependent variables which would be responsible forconfounding of effects. Uncontrolled variables (called alsocovariates) are individual size, cumulated size, and com-plexity and reuse level of the framework. Being measured aposteriori, their effect on performance can also be analyzed.In the next paragraph, the factor, the covariates and theirmetrics are defined. Table 2 summarizes the variables,adding mathematical definitions and units.

. Development mode, F. It is a two-level qualitativefactor. Levels are development without the frame-work, coded with 0, and development with theframework, coded with 1.

. Net size of an application, S. It is a covariate becauseits value is unknown before development. It con-siders only newly written code, thus excludingreused code. It is expressed as Object OrientedFunction Points (OOFP [2]), which compute thefunctionality starting from specific classes and

880 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002

TABLE 1Characteristics of Applications in Experimental and Control Groups

Page 6: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

methods. As plausible metrics, initially, we alsoconsidered Function Points (FP [1], [13]), SystemMeter [23], and the classical Source Lines Of Code(SLOC). We excluded FPs because counting ruleswere not appropriate for Web-based object-orientedapplications and because they do not allow thedistinction between functionality provided by theframework and that provided by the code developedfor the specific application. This leads to theconclusion that Function Points cannot be used tomeasure the net size of an application developed,nor, as a consequence, the reuse level [27]. Weexcluded SLOCs because they have been heavilycriticized for causing practical difficulties when usedto measure productivity [15]. We preferred OOFPsto the System Metric mentioned above becauseOOFPs are more recent, follow the concepts of thewell-known FP and are more straightforward toapply. The reader may refer to [22] for more detailsabout measurement activities and for a discussionregarding measurement of the reuse level.

. Cumulated net size, L. It is a covariate that sums upthe net size of applications developed up to thecurrent one. It is introduced to express the effect onthe performance of the programmer’s learning overtime. See Construct Validity later on for a discussionof its ability to produce the desired effect.

. Complexity, C. It is a covariate measured afterdevelopment as the number of methods per unitnet size. It was inspired by the WMC measure inChidamber and Kemerer’s [6] metric suite. WMCconsiders each method as a complexity unit at thelevel of a class. In our study, we consider eachmethod as a complexity unit at the level of the entireapplication.

. Reuse level, RL. We use the traditional definition forreuse level [27], or the ratio between the size of whatis reused from the framework and the total deliveredsize in an individual application. Ideally, the frame-work is reused in toto by each application. In actualfact, each application reuses part of the frameworkbut not all of it. The difficulty is that measuring theexact amount of framework reused by each applica-tion is impractical. We introduce the utilization factorUFWK , with range [0,1], (UFWK ¼ 1 means the frame-work is reused in toto) into the definition of the reuselevel. Normally, it should be a covariate, measuredex-post for development with the framework (beingzero for development without the framework).However, in our case, reuse level varies minimallywithin framework-based applications (from 77 to85 percent, assuming UFWK ¼ 1) and, therefore, it ispractically an alias of the development mode factor.For this reason, it is excluded from the analysis.

MORISIO ET AL.: QUALITY, PRODUCTIVITY, AND LEARNING IN FRAMEWORK-BASED DEVELOPMENT: AN EXPLORATORY CASE STUDY 881

TABLE 2Factors, Covariates, and Responses in the Experiment

Page 7: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

Two dependent variables, responses, are of interest in

the present study (see Table 2 for more details).

. Productivity, p. It is calculated as the net size of anapplication divided by the effort required for itscompletion. Please notice that we use net size at thenumerator, so productivity should actually be callednet productivity—we omit the adjective merely forthe sake of simplicity. The development effort ismeasured by the programmer, who logs it on a dailybasis.

. An index of quality of programming, q. It is definedas the relative deviation between development effortand the rework effort required to correct code failuresidentified by the acceptance test (its range is 0 to 1, 1means no failure encountered; 0 means that reworkeffort for correction equals development effort).

For ease of interpretation, both responses are defined as thelarger the better.

3.2.3 Experimental Plan and Design Criteria

The overall experiment plan is depicted in Table 3.Applications in the control group appear in the far rightcolumn, the others in the middle column. Run orderincreases from the top down. The time schedule and theeffort figures of the developed applications are reported inFig. 2. The whole development process was completed in asix-month period with a one-month vacation interruption.The programmer worked nearly full time on this project.

A brief description of each application, in the same order

as developed, is given below for interested readers.

1. SPY—Spy Camera. This application permits themonitoring of a number of video cameras ondifferent sites. The observer starts the application,specifying the sites where the cameras are locatedand then connecting/disconnecting with them. Thesite must accept connection or disconnection by theobserver.

2. TLL—Telelearning. This application permits thesetting up of a distance learning session. The rolesinvolved are teacher and students. The teacher startsand ends a lesson. A student can join/leave a lessonupon permission from the teacher. The teacher candecide to test a student (in this case, all otherstudents are simply observers) and to exclude a

student from the session. In addition, a student mayprivately pose questions to the teacher.

3. SPY—Multispy camera multiobserver. Variant ofSpy camera. Same as Spy camera, but there may beseveral observers during a single session.

4. MCS—Multiconference. Audio and video multicon-ference on the Internet. The roles are chairman andparticipant. All roles can see and hear otherparticipants. The chairman starts and ends thesession, calls participants to join, accepts or refusesrequests to leave from participants. The chairmanmoderates the session by giving the floor to oneparticipant at a time. He is allowed to pass his role toanother participant.

5. AUC—Auction. This application offers an Internet-based auction. The roles are auctioneer and bidder.The auctioneer starts and ends the session. Heauctions one item at a time and regulates bids,assigning the item to the best bid, or possibly retiringit from the auction. A bidder can exit the session onlyif his bid is not the highest one at a certain moment.

6. AUC Browser—auction with browser—variant ofauction. The only difference is that when the firstitem is auctioned, an internet browser is started oneach bidder’s terminal, open at the URL where theitem is described.

7. MCS Mail—variant of multiconference. In MCS, ifthe chairman calls a participant and cannot reachhim, nothing further occurs. In this variant, an e-mailis sent to the participant to notify him of theunsuccessful attempt to reach him.

8. MCS Sms—variant of multiconference. An SMSmessage is sent when the participant cannot bereached by the chairman.

9. VOD—Video on demand. The roles are customerand provider. The customer requests commence-ment of a session, selects a movie from a catalog,pays. The provider checks these operations; whencompleted successfully, he starts the movie for thecustomer. The provider ends the session when themovie ends; the customer can pause/resume thesession, or stop it.

The time sequence of the applications is a compromisebetween the randomization principle commonly used inDesign of Experiments and the practical needs of the study.A randomized run order is normally adopted, as it ensuresthat time order neither masks existing factor effects nor iterroneously reveals not existing ones [4]. In our case, this

882 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002

TABLE 3Experimental Plan

Fig. 2. Time schedule of the developed applications, with framework(white boxes) and without framework (gray boxes). Effort figures, inhours, are marked inside the boxes. Month 3 was a vacation period.Note that for applications without framework the figures refer only tocomponent development.

Page 8: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

precaution is even more vital since we are interested ininvestigating the learning effect. For example, had weadopted the sequence 111110000 (five applications withframework, followed by four without), the effect of factor F(development mode) would have been confused with thelearning effect because the last four developments (F = 0)would have benefited from a larger learning than the fiveinitial ones (F = 1). However, a complete randomizationcould not be applied in our study for two reasons: First, wehad the obvious constraint that variants had to bedeveloped after their parental applications. Second, wedeliberately planned to develop the two most similarframework-based applications (SPY and VOD) at thebeginning and at the end of the experiment. In this way,the estimate of the learning effect in framework-baseddevelopment (one of our major concerns) should haveincreased precision because it exploits the largest possibleleverage of learning (likewise a larger range of thex independent variable increases precision of the estimatedslope of yðxÞ when fitting a straight line to data usingsimple linear regression).

We did not include replicated runs in the experiment,although they are quite valuable in providing an externalestimate of the random scatter of responses. We believe thatreplicating the development of an application may becounterproductive in the case of a single programmer. Infact, the second time a programmer develops an applica-tion, he is faster not only because of learning, but alsobecause he remembers the design and code alreadydeveloped and this would positively bias his performance.

3.3 Validity

We discuss here the validity of the experiment. In the samemanner as [16], [32], we consider construct validity,internal, and external validity.

3.3.1 Construct Validity

Construct validity aims at assuring that the metrics used inthe study reflect real-world entities and attributes reliably.The key attributes here are productivity, quality, andlearning. Measures selected for size (OOFP) and complexity(number of methods per unit size) have already beenjustified.

For productivity, we use net productivity, i.e., the ratio ofthe size of software newly written for an application, andthe effort to develop the application. Another option is touse gross size (size of the framework + net size) at thenumerator, the rationale being that an application deliversthe whole functionality in the framework, and not onlywhat is newly written. Symmetrically, the denominatorshould consider the effort required to develop the frame-work. Since, however, the framework is reused in severaldifferent applications—with different reuse levels—it is notclear how to distribute among them the effort expended indeveloping the framework. What is more, the frameworkand applications are developed by different programmers,so it would not be correct to mix their productivity levels.For these reasons, we use net productivity. Naturally, theinvestment needed to develop the framework should alsobe taken into account in a comparative economic evaluation.

For quality, we use an index of rework effort. Althoughusing the number of defects could have been an alternative,we chose rework effort because it can be neatly integrated

into the overall effort figure, thus providing a consistentand complete base on which to build a cost model for break-even point prediction.

The metric used to capture the learning effect is a delicateissue. We measured the learning effect on performance(productivity or quality) at a given time as the improvementof that performance (with respect to the beginning of thedevelopment)—due solely to the fact that our programmerhad already developed a certain size of code previously. Tounderstand our approach, it is useful to introduce here thefull model used to describe the experimental responses:

y ¼ �0 þ �1F þ �2F þ �12FLþ �3S þ �4C þ "; ð1Þ

where y is either productivity or quality index, coefficients� are the parameters to be statistically estimated from data,and " is the experimental error, assumed to be normallydistributed with zero mean. The learning effect on y up tothe total developed size L is:

�2Lþ �12FL ¼ ð�2 þ �12F ÞL;

which is �2L for nonframework-based development andð�2 þ �12ÞL for framework-based development. In reality,there are three components of learning, namely, learningthe programming language, the application domain, andthe framework usage. Learning the programming languageand the application domain occur in both developmentmodes, while improvement in use of the framework isobviously specific to framework-based development. There-fore we can legitimately associate the common term �2Lwith nonspecific learning and the differential term �12Lwith framework-specific learning. That is why it is vital toinclude the interaction between L and F in the model.

A good measure of learning is one that documents theactual rate at which the programmer understands conceptsduring his work; unfortunately, however, a measurementtool of this nature is extremely difficult to find. One optionis periodical tests or possibly interviews with the program-mer; these are rather complicated to manage and subject toerrors. Monitoring the subjective feeling of the programmermight be an option; this is less complicated to manage, buteven more error-prone. If we take the example ofmanufacturing, the learning of a worker in doing arepetitive job is measured by the reduction in the timerequired to do the same task. We used an approach similarto the manufacturing situation and chose as a proxy oflearning the cumulative size of software written by theprogrammer. This has the advantage of being objective andeasy to collect. While we realize that this is not the optimalchoice, since in our case the task is not completelyrepetitive, it is definitely cost-effective.

3.3.2 Internal Validity

Internal validity discusses whether the experimental designallows the demonstration of causality between inputvariables (factors and covariates) and responses.

The design isolates one factor and considers threecovariates. The primary concern focuses on the effects ofthe factor, the learning covariate, and their interaction; theother two covariates (size and complexity) are addedmerely to improve precision in the estimation of the threeabove effects. We wish to point out that hypotheses H1 andH3 are related to testing whether the effect of F is significantand positive (i.e., �1 > 0 in the models for productivity and

MORISIO ET AL.: QUALITY, PRODUCTIVITY, AND LEARNING IN FRAMEWORK-BASED DEVELOPMENT: AN EXPLORATORY CASE STUDY 883

Page 9: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

quality); hypotheses H2 and H4 are related to testingwhether the effect of FL is significant and positive (�12 > 0in the models).

The number of runs (nine: four in the control group; fivein the experimental) was chosen in order to accommodatemodels with five terms at most, excluding the constant, (seemodel (1))—equivalent to about one half of the total degreesof freedom. However, even more parsimonious models maybe expected because the size variable S might be notstatistically significant, having already been accounted forin the definition of the responses (explicitly in productivityand implicitly in the quality index). This is likely to reducethe number of active predictors in the fitted models.

Design criteria have already been discussed.

3.3.3 External Validity

External validity discusses how well the study results canbe generalized.

Our study is an exploratory case study on a single subject.We needed to observe the same programmer several times inorder to make inferences about learning. Having only oneprogrammer involved did not present us with an idealsituation. On the other hand, replicating the study with moresubjects was considered too expensive by the company.

. On the negative side, we have no quantitative,objective means to evaluate our programmer incomparison to a population of programmers eventhough it is well known that individual personalitycharacteristics can have huge effects on the productsand process of programming. In the subjectivejudgment of the authors, our programmer isprobably above-average but is not unrepresentativeof the general programmer population.

. On the positive side, the study is comparative: Thesame programmer is used to compare two ways ofprogramming. In principle, there is no reason whyhe should favor either one of them.

Overall, we cannot claim that our results remain valid forall programmers. However, we believe that we haveprovided an empirical base for discussion which has beenmissing so far, and hope to stimulate interested researchersto make confirmatory experiments or other related studiesin the future, since advances in empirical research arenearly always incremental.

The study is limited to a specific programminglanguage, application domain, and framework. The pro-gramming language employed has become increasinglypopular and one can argue that it is representative of theclass of object-oriented programming languages widelyused in practice. The application domain is also increas-ingly popular, but clearly has a number of specificcharacteristics that could influence results.

The size and design of the framework could alsoinfluence results. The design is as simple as possible andprobably representative of a large family of frameworks.The framework is made up of 22 classes, or slightly lessthan 10Kloc. We believe it can be classified as a small-to-medium framework. We believe that large frameworks(100 classes and more) have different effects on learning:The learning duration may well be longer and the variationin productivity for a programmer after learning could behigher (more leverage from a larger framework) or lower

(more difficulty in managing a larger framework) than inour case. It should be noted that, given a complex domain,the trend is to develop several related but smaller frame-works rather than developing a single large framework.

Finally, reuse level does not vary considerably in thepresent study—which explains its exclusion from themodel. We feel comfortable in stating that this does notinfluence results when the reuse level is very high, as in ourcase. When the reuse level is lower (especially below50 pecent), results may well be different. However, reusinga framework is not usually justifiable when the reuse levelis so low. An issue related to reuse level is selection of theapplications. Ours were selected so as to be suitable forimplementation with the framework (i.e., so that they had ahigh reuse level). We believe this is not a threat to theanalysis and extrapolation of results since it reflects acommon practice with frameworks: First, a framework isdeveloped when it is likely that several applications can bederived from it; then the framework is used whenever it issuitable for supporting an application.

3.4 Statistical Analysis

Hereafter, we present the empirical results and thestatistical analysis performed on them. The Minitab package[21] has been used for the analysis. Data referring toindependent variables and responses resulting from thestudy are reported in Table 4. Simple graphical displays ofthe two responses using dot-plots (Fig. 3) and dispersionplots versus cumulated size (Fig. 4) reveal the fundamentalfeatures of results. First, we make a qualitative evaluation ofthe four hypotheses by looking at the diagrams. Later, weperform statistical tests to give formal assurance. Model (1)is also estimated for both productivity and quality.

Hypotheses H1 and H3 are evaluated by comparing, inthe two development modes, the average figures ofproductivity (H1) and quality (H3). In both cases (seeFig. 3 and Fig. 4), data points in the experimental subgroupare higher than those in the control subgroup. This gives usa clear graphical indication that both H1 and H3 are true.

Hypothesis H2 and H4 are evaluated by comparing, in thetwo development modes, the variation of productivity (H2)and quality (H4) over time (Fig. 4 reports cumulated size onthe x axis, but size cumulates with time). In Fig. 4, regressionlines fitting the data points are also plotted for ease ofinterpretation. Productivity, both with and without theframework, increases. But, productivity using the frameworkincreases more than productivity without it. This suggeststhat H2 is true. As far as the quality index is concerned, wecan observe that there is an improvement over time in bothcases, but improvement using the framework is now inferiorto improvement without it. This suggests that H4 is false.

The analysis summarized in Fig. 4 is simple andintuitive, but limited since it considers only cumulated sizeas an independent variable. A more rigorous evaluation ofthe hypotheses needs to consider all the input variablesavailable in the study, as in model (1). Now, we proceedwith estimating the model, i.e., its parameters �.

Model estimation is carried out in two steps. First, weidentify, for both responses, the statistically significantpredictors in model (1), then we fit the model estimatingonly the coefficients related to the significant predictors.Identification of significant predictors is performed byAnalysis of Covariance [19], a statistical procedure whichpermits the simultaneous testing of the effects of factor F,

884 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002

Page 10: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

covariates S, L, C, and interaction FL on responses. Noticethat, in our case, we could not apply the more usualAnalysis of Variance. In fact, the presence of covariatesmakes the design nonorthogonal and, as a consequence,contributions of all terms in model (1) to total variation ofresponses are not separate from each other. Therefore,analysis cannot be conducted in a “one-shot” fashion as inthe Analysis of Variance. Instead, Analysis of Covariance isbasically a trial-and-error iterative procedure aiming atselecting the best subset of explanatory input variables; itcan also be regarded as a procedure for identifying the bestform of a linear predictive model for an experimentalresponse [4]. A summary of the Analysis of Covarianceprocedure is reported in the appendix.

The outcome of the procedure is that for productivity,the 90 percent statistically significant predictors (in order ofsignificance) are: FL, F, C, and L. For the quality index, onlyF and L are significant above 90 percent. Finally, we buildthe models using least square linear regression on our dataset. Note that parsimonious models will be obtained (fourpredictors for productivity, two for quality); this is anassurance on the goodness of the models as we are relyingupon nine experimental runs only. The model forproductivity is:

pp ¼ 3:28þ 1:54F þ 1:11L� 18:4C þ 2:46FL ð2Þ

with a standard error of estimate of 0.267 and an adjusteddetermination coefficient of 98.4 percent. Programmingwith the framework results in an average initial benefit innet productivity of 1.54 OOFP per hour, which grows veryquickly thanks to a framework-specific learning rate whichis more than double (2.46) that of the nonspecific one.Comparison between net productivity in the two devel-opment modes, as predicted by the model, is depicted inFig. 5 (complexity is fixed at the sample mean, 0.17 meth-ods per OOFP).

The regression model for the quality index is:

qq ¼ 0:627þ 0:191F þ 0:0881L ð3Þ

with a standard error of estimate of 0.0338 and an adjusteddetermination coefficient of 89.0 percent. Programmingwith the framework results in an improvement of theq index of nearly 0.2. Notice that the interaction FL is notincluded in the model because its effect is not statisticallysignificant over the 90 percent threshold selected. In fact,FL’s effect on quality is significant about 80 percent (in amodel including F, L, and FL). This implies that the q indexrises at the same rate in the two development modes whenL increases (Fig. 5), differently from what Fig. 4 suggested.

We can now analyze again the hypothesis, using models(2) and (3) and Fig. 5. Values of both productivity andquality, for a fixed value of cumulated size, are alwayshigher when using the framework (F = 1). Therefore, we canconclude that H1 and H3 are true. To test H2 and H4, wehave to look at the slopes of the lines. In the case ofproductivity, the slope when using the framework (F = 1) ishigher, so we can conclude that H2 is true. In the case ofquality, the slopes with or without framework are the same,so H4 is false.

In practice, the discussion of the hypotheses has producedthe same results, both using the plots of the data points(Fig. 4) and the statistical models (Fig. 5). Needless to say,statistical models have a higher degree of reliability than asimple glance at the data, both because models consider allthe explanatory variables (only L is in Fig. 4) and becausethey are estimated after testing the statistical significance of

MORISIO ET AL.: QUALITY, PRODUCTIVITY, AND LEARNING IN FRAMEWORK-BASED DEVELOPMENT: AN EXPLORATORY CASE STUDY 885

TABLE 4Measures of Inputs and Outputs of the Study

Data refer to new documents and code, and do not consider documents and code in the framework.

Fig. 3. Dotplots of responses. Empty/filled circles indicate developmentwithout/with framework; numbers attached to circles refer to theexecution order of experimental runs.

Page 11: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

the explanatory variables. Incidentally, this is the reasonwhy in Fig. 4 the slopes relating to quality are different, whilein Fig. 5 they are the same: The model for quality does notinclude the FL term because it is not statistically significant.

Recently, a critique of software defect prediction modelshas been published [10]. This work concludes that it isdifficult to predict defect density using size and complexitymetrics alone. For example, the authors argue that the timededicated to the test of each module is an important factorthat should always be accounted for in building the model.In our case, however, this factor was considered during thedesign of the experiment and it was assured that theprogrammer tested both applications on an equal basis.

4 CONCLUSIONS

The objective of our work was to investigate quality andproductivity issues and the effect of learning in framework-based object-oriented development. Our motivationstemmed from the need to measure and understand thepotential benefits that frameworks are supposed to provide.The lack of studies publishing data and evaluation resultsfrom framework-based projects was an additional stimulus.

In order to achieve our goal, we carefully designed anexploratory study using a single programmer in anindustrial environment. The decision to use a single subjectwas dictated by constraints imposed by our industrialpartner. The programmer produced five applications fullybased on a framework and four applications based ontraditional component development (control group). We

defined metrics for productivity, quality, and learning—paying particular attention to consider net size only (i.e.,newly developed code only and not reused code). Based onmeasurements taken on the developed applications, astatistical analysis of the data collected was performed.

We observed that productivity and quality were sub-stantially higher in applications developed using theframework than in applications implemented throughtraditional development. The general expectation whenusing frameworks is that there will be an increase in grossproductivity (i.e., productivity including both the size ofnew code and the size of the framework). This expectationis quite intuitive; our study confirmed its validity. Further,the study demonstrated that also net productivity (i.e.,productivity considering only new code developed aroundthe framework) is higher. A less intuitive expectation isthat implementing one unit of functionality (one OOFP inour case) around a framework is faster than implementingone unit of functionality without a framework. Again, thestudy confirmed the expectation’s validity. This finding canbe explained by considering that a framework encodes themost difficult design and coding issues of a domain orsubdomain. The programmer reuses the difficult parts andwrites only the remaining easier parts. Framework usersperform tasks that are more similar to parameter setting:Compare, for instance, the task of designing and coding analgorithm to that of designing and coding of the datainput/output functions of the same program. The lattertask presents far fewer problems and “tricky points” thanthe former. On the other hand, the developer of aframework has to deal with the major design and codingproblems and challenges. Therefore, our hypothesis is thatthe productivity of framework developers is below averageand substantially less than that of the framework users. Asimilar reasoning applies to quality (in our case, effortneeded to repair defects). Quality is higher when reusing aframework because the more difficult tasks have alreadybeen performed by framework developers.

We also observed significant gains in productivity andquality due to a learning effect, or the improved skill of theprogrammer in performing a task, due to the repetition ofthe task over time. In particular, productivity exhibited animpressive learning effect for framework-based applica-tions. This finding confirms the experience of anyone whohas used a framework. The leverage they offer is high, butthe quantity of knowledge to be digested before becoming aproficient user is huge. In other words, learning is the key

886 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002

Fig. 4. (a) Dispersion plots of productivity and (b) quality index as a function of the cumulated size of developed applications. Empty/filled circlesindicate development without/with framework. Regression lines for the two subgroups are also displayed.

Fig. 5. Net productivity (solid lines) and quality index (dashed) versus net

code size in development with and without framework. Complexity is

fixed at the sample mean, 0.17 methods per OOFP. After 1.47, kOOFP

curves extrapolate experimental results.

Page 12: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

and sufficient time should be allocated for it. A possibleinterpretation for this lengthy learning time is that, inframework-based development, a major conceptual form oflearning is involved. Borrowing from the organizationallearning theory of [18], [24], we distinguish two types oflearning: operational and conceptual. The former deals withspeeding up repetitive operations, the latter with acquiringconceptual knowledge. As frameworks encapsulate high-level knowledge, the conceptual component of learning ismore prominent than in traditional development.

The issue of when learning has finished, and the bestway of measuring it (in number of applications developed?in cumulative size developed?) remains open and requiresnew empirical studies.

Also, our data showed that learning had no effect on thefailure-occurrence rate in development with framework ascompared to development without framework. Seemingly,this result contradicts the previous one. However, a reason-able explanation is that the quality index is limited by 1 and,approaching this limit, room for further improvementbecomes less and less. As a matter of fact, quality figuresfor framework development are considerably closer to unitythan those of traditional development and this might limitthe leverage of learning. Anyway, further studies areneeded to clarify the issue.

There are certain limitations that allow the aboveconclusions to be generalized only to some extent. Themajor limitation is the employment of a single subject in thestudy. We judged our programmer as belonging to aslightly above average level of competence—which doesnot allow, for example, the generalization of our findings incases where particularly skilled subjects are employed inframework-based development. Other characteristics of ourstudy that may limit conclusions are the specific frameworksize (small to average) and the specific application domain.However, we argue that framework size may not be so

prohibitive as to prevent our conclusions from being ofpractical use since in complex domains a reasonablestrategy would be to employ more than one small frame-work instead of a single large one.

Though our findings cannot be said to be definitive, theydo provide a basis for discussion and indicate futureresearch directions. As an example, the replication of ourstudy with more than one subject would produce a sounderbasis for empirically derived conclusions about framework-based development. We anticipate, however, that such anexperiment will demand significantly more resources thanour study had available. Experiments investigating impor-tant issues such as framework size, programming language,application domains, etc., are also needed. Finally, empiri-cal investigation of the framework investment break-evenpoint is necessary in order to provide managers withimportant decision-making tools which may assist thespread of such a promising technology. We are currentlyworking on this issue.

APPENDIX

In Tables 5 and 6, a summary of the Analyses of Covariancefor productivity and quality, including all terms (covariatesS, L, C, factor F and interaction FL) is presented. Inproductivity, one effect stands out, FL—, and one is notstatistically significant, S, while the others are questionable(Table 5). Pooling into error the nonsignificant effect, thenew Analysis of Covariance in Table 7 shows that thesignificance of all remaining effects is increased; a con-fidence limit superior to 95 percent holds for FL, F, and C (inorder of importance), superior to 90 percent for L.

The Analysis of Covariance—with all terms for qualityindex (Table 6)—signals F and L effects as significant; S andC effects appear not to be significant, while FL is doubtful.Eliminating S and C, the new analysis (Table 8) confirmsthat only F and L effects are significant, with a confidencelimit superior to 99 and 95 percent, respectively. FL is lessthan 90 percent significant.

MORISIO ET AL.: QUALITY, PRODUCTIVITY, AND LEARNING IN FRAMEWORK-BASED DEVELOPMENT: AN EXPLORATORY CASE STUDY 887

TABLE 5Analysis of Covariance Table for Response p, Productivity

All terms are considered.

TABLE 6Analysis of Covariance Table for Response q, Quality Index

All terms are considered.

TABLE 7Analysis of Covariance Table for Response p, Productivity

Only terms potentially significant are considered.

TABLE 8Analysis of Covariance Table for Response q, Quality Index

Only terms potentially significant are considered.

Page 13: Quality, productivity, and learning in framework-based ...pahsiung/courses/se/notes/01033227.pdf · Quality, Productivity, and Learning in Framework-Based Development: An Exploratory

For details on the use of Analysis of Covariance see, forinstance, Mason et al. [19].

ACKNOWLEDGMENTS

The authors would like to thank Fabio Balduzzi, CorradoMoiso, and Bruno Noero for initiating the study, and ShariLawrence Pfleeger, Forrest Shull, the anonymous referees,and Don Batory for reviewing the paper and contributing toimprove it.

REFERENCES

[1] A.J. Albrecht, “Measuring Application Development Productiv-ity,” Proc. Joint SHARE, GUIDE, and IBM Application DevelopmentSymp., Oct. 1979.

[2] G. Antoniol, K. Lokan, G. Caldiera, and R. Fiutem, “A FunctionPoint-Like Measure for Object-Oriented Software,” EmpiricalSoftware Eng., vol. 4, no. 3, pp. 263-287, 1999.

[3] V.R. Basili, R.W. Selby, and D.H. Hutchens, “Experimentation inSoftware Engineering,” IEEE Trans. Software Eng., vol. 12, no. 7,July 1986.

[4] G.E. Box, W.G. Hunter, and J.S. Hunter, Statistics for Experimenters:an Introduction to Design, Data Analysis and Model Building. NewYork: John Wiley, 1978.

[5] G.E.P. Box and N.R. Draper, Empirical Model Building and ResponseSurfaces. New York: John Wiley and Sons, 1986.

[6] S.R. Chidamber and C.F. Kemerer, “A Metric Suite for ObjectOriented Design,” IEEE Trans. Software Eng., vol. 20, no. 6, pp. 476-493, June 1994.

[7] J.Coplien, D.Hoffman,andD. Weiss,“CommonalityandVariabilityin Software Engineering,” IEEE Software, pp. 37-45, Dec. 1998.

[8] J.M. Favaro, K.R. Favaro, and P.F. Favaro, “Value Based ReuseInvestment,” Annals of Software Eng., vol. 5, pp. 5-52, 1998.

[9] M.E. Fayad and D.C. Schmidt, “Object Oriented ApplicationFrameworks,” Comm. ACM, vol. 40, no. 10, pp. 32-38, Oct. 1997.

[10] N.O.E. Fenton and M. Neil, “A Critique of Software DefectPrediction Models,” IEEE Trans. Software Eng., vol. 25, no. 5,pp. 675-689, Sept./Oct. 1999.

[11] R.A. Fisher, “The Arrangement of Field Experiments,” J. Min.Agriculture, vol. 33 pp. 503-513, 1926.

[12] W. Harrison, “N=1, an Alternative for Software EngineeringResearch?” Proc. Workshop Beg, Borrow, or Steal: Using Multi-disciplinary Approaches in Empirical Software Eng. Research, Int’lConf. Software Eng., Aug. 2000.

[13] IFPUG, “Counting Practices Manual, Release 4.0,” Int’l FunctionPoint Users Group, Jan. 1994.

[14] R. Johnson and E. Foote, “Designing Reusable Classes,” J. ObjectOriented Programming, vol. 1, no. 5, June 1988.

[15] C. Jones, Applied Software Measurement: Assessing Productivity andQuality. McGraw Hill, 1991.

[16] C.M. Judd, E.R. Smith, and L.H. Kidder, Research Methods in SocialRelations, sixth ed. Rinehart and Winston, 1991.

[17] Single Case Research Design and Analysis, T. Kratochwill and J. Levineds. Lawrence Erlbaum Associates, 1992.

[18] J.G. March, “Exploration and Exploitation in OrganizationalLearning,” Organizational Learning, M.D. Cohen and L.S. Sproull,eds., 1996.

[19] R.L. Mason, R.F. Gunst, and J.L. Hess, Statistical Design andAnalysis of Experiments. New York: John Wiley and Sons, 1989.

[20] M. Mattsson, “Effort Distribution in a Six Year IndustrialApplication Framework Project,” Proc. IEEE Int’l Conf. SoftwareMaintenance (ICSM ‘99), pp. 326-333, 1999.

[21] MINITAB for Windows, Minitab Inc., 2000.[22] M. Morisio, D. Romano, I. Stamelos, and B. Spahos, “Measuring

Functionality and Productivity in Web-Based Applications: ACase Study,” Proc. Sixth IEEE Int’l Symp. Software Metrics, 1999.

[23] S. Moser and O. Nierstrasz, “The Effect of Object-Oriented Frame-works on Developer Productivity,” Computer, pp. 45-51, Sept. 1996.

[24] I. Nonaka, “A Dynamic Theory of Organizational KnowledgeCreation,” Organizational Science, vol. 5, no. 1, pp. 14-30, 1994.

[25] A.A. Porter, L.G. Votta, and V.R. Basili, “Comparing DetectionMethods for Software Requirements Inspections: A ReplicatedExperiment,” IEEE Trans. Software Eng., vol. 21, no. 6, pp. 563-575,June 1995.

[26] A.A. Porter and P.M. Johnson, “Assessing Software ReviewMeetings: Results of a Comparative Analysis of Two ExperimentalStudies,” IEEE Trans. Software Eng., vol. 23, no. 3, pp. 129-145, Mar.1997.

[27] J.S. Poulin, Measuring Software Reuse: Principles, Practices andEconomic Models. Addison-Wesley, 1997.

[28] J.S. Poulin, “The Economics of Software Product Lines,” Int’l J.Applied Software Technology, vol. 3, no. 1, pp. 20-34, Mar. 1997.

[29] L. Prechelt and W. Tichy, “A Controlled Experiment to Assess theBenefits of Procedure Argument Type Checking,” IEEE Trans.Software Eng., vol. 24, no. 4, pp. 302-312, Apr. 1998.

[30] F. Shull, F. Lanubile, and V.R. Basili, “Investigating ReadingTechniques for Object-Oriented Framework Learning,” IEEETrans. Software Eng., vol. 26, no. 11, Nov. 2000.

[31] S. Srinivasan, “Design Patterns in Object-Oriented Frameworks,”Computer, Feb. 1999.

[32] C. Wohlin, P. Runeson, M. Host, and M.C. Ohlsson, Experimenta-tion in Software Engineering, an Introduction. Kluwer, 2000.

Maurizio Morisio received the PhD degree insoftware engineering and the MSc degree inelectronic engineering from Politecnico di Tor-ino. He is a research assistant in the Diparti-mento di Automatica e Informatica, Politecnicodi Torino, Turin, Italy. He recently spent twoyears working with the Experimental SoftwareEngineering Group at the University of Mary-land, College Park. During that time, he was thecodirector of the Software Engineering Labora-

tory (SEL), a consortium of NASA Goddard Space Flight Center, theUniversity of Maryland and Computer Science Corporation, which hasthe mission of improving software practices at NASA and CSC. Hisresearch and consulting is aimed at understanding how software isproduced and maintained, in order to improve software processes andproducts in industrial settings. His current focus is on open sourcedevelopment and service engineering for the wireless internet. He ismember of the IEEE Computer Society.

Daniele Romano received a degree in electro-nic engineering in 1990 at the Politecnico diTorino where he was appointed assistant pro-fessor at the Department of Production Systemsand Economics in 1994. Since 2000, he hasbeen an associate professor at the University ofCagliari where he lectures in the quality man-agement and computer-aided production sys-tems courses. His current research activity is inthe field of industrial statistics and focuses on

the study and improvement of industrial processes and products. He isan expert on design of experiments techniques, using both physical andnumerical experiments. He is the author of more than 50 scientificpapers and is a member of the American Society for Quality (ASQ) anda founding member of the European Network for Business and IndustrialStatistics (ENBIS).

Ioannis Stamelos received the BSc degree inelectrical engineering from the PolytechnicSchool of Thessaloniki and the PhD degree incomputer science from the Aristotle University ofThessaloniki (1988). He is a lecturer of computerscience at the Department of Informatics,Aristotle University of Thessaloniki, Greece,since 1997. Before his recruitment in theacademy, he worked as a senior researcher atTelecom Italia (1988-1994) and as systems

integration director at STET Hellas, a mobile telecom operator (1995-1996). He teaches courses on language theory, object orientation,software engineering, and information systems. His research interestsinclude evaluation, cost estimation, and management in the areas ofsoftware development and information systems. He is a member of theIEEE and IEEE Computer Society.

. For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dilb.

888 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 28, NO. 9, SEPTEMBER 2002