empirical evaluation of an educational game on software ...€¦ · empirical evaluation of an...

Empirical evaluation of an educational gameon software measurement

Christiane Gresse von Wangenheim & Marcello Thiry &

Djone Kochanski

Published online: 1 October 2008# Springer Science + Business Media, LLC 2008Editor: Forrest Shull

Abstract Software measurement is considered important in improving the softwareprocess. However, teaching software measurement remains a challenging issue. Although,games and simulations are regarded powerful tools for learning, their learning effectivenessis not rigorously established. This paper describes the results of an explorative study toinvestigate the learning effectiveness of a game prototype on software measurement inorder to make an initial judgment about its potential as an educational tool as well as toanalyze its appropriateness, engagement and strengths & weaknesses as guidance forfurther evolution. Within the study, a series of experiments was conducted in parallel inthree master courses in Brazil. Results of the study reveal that the participants consider thecontent and structure of the game appropriate, but no indication for a significant differenceon learning effectiveness could be shown.

Keywords Measurement . Educational game . Project management . Experiment

1 Introduction

Although there have been significant advances on the consciousness of the potentialbenefits of software measurement, software industry still remains slow in establishing

Empir Software Eng (2009) 14:418–452DOI 10.1007/s10664-008-9092-6

C. Gresse von Wangenheim (*) :M. Thiry :D. KochanskiUniversidade do Vale do Itajaí (UNIVALI), Master Program on Applied Computer Science,São José, SC, Brazile-mail: [email protected]

M. Thirye-mail: [email protected]

D. Kochanskie-mail: [email protected]

C. Gresse von WangenheimUniversidade Federal de Santa Catarina (UFSC) Graduate Program in Computer Science,Florianópolis, SC, Brazil

measurement programs and, often, measurement initiatives continue to fail (Dekkers andMcQuaid 2002; Kasunic 2006). One of the reasons assumed is a lack of education (Löperand Zehle 2003; Hock and Hui 2004), as many computer science courses still do not coversoftware measurement as part of their curriculum. At most, students are taught a minimumof basic knowledge on software measurement as part of an undergraduate or graduatesoftware engineering lecture (Hock and Hui 2004; Ott 2005). In addition, many do notstress the importance of software measurement in practice and, consequently, fail tomotivate the students sufficiently, as measurement is often perceived as a complex anddifficult subject (Hock and Hui 2004).

One of the reasons for this problem is the way in which software measurement is taught.Expository lessons are still the dominant instructional technique in, basically, all sectors ofeducation and training (Percival et al. 1993). While they are adequate to present abstractconcepts and factual information, they are not the most suitable for higher-cognitiveobjectives aiming at the transfer of the knowledge to real-life situations (Choi and Hannafin1995). Another disadvantage is that such an education is not only cost intensive, but alsotime-consuming (Bruns and Gajewski 1999). And, as measurement is only one area ofsoftware engineering, typically, there is not sufficient time in software engineering lecturesto provide students with a solid understanding as well as to teach them the application ofmeasurement in practice. Such practical constraints usually limit the exposure of students torealistic measurement programs. Especially, as practical exercises typically require a closesupervision by instructors and a significant amount of time to be done. Therefore, it remainsa challenge to teach students in a compact, but interesting way, so that they understand thekey concepts and are capable to apply measurement in real-world situations.

In this context, educational games have become an alternative providing variousadvantages (Percival et al. 1993). They can allow the virtual equivalent of a real-worldexperience. Thus, they can be effective in reinforcing the teaching of basic concepts bydemonstrating their application and relevance as well as in developing higher-cognitivecompetencies by providing illustrative case studies (Percival et al. 1993). Especially,computer-based games can allow “learning by doing” in realistic situations with immediateprovision of feedback. This can also leave the student more confident in his/her ability tohandle a similar situation in real life. Another advantage is that the learners can work attheir own pace, not requiring the presence of an instructor or interaction with others. And,building on the engaging nature of games, they can make learning more fun, if not easier(Kafai 2001).

In this context, we are developing a computer-based educational game on softwaremeasurement with the objective to reinforce the remembering and understanding of basicconcepts and to train measurement application. Yet, although computer games andsimulations are regarded powerful tools for learning, due to a lack of well-designed studiesabout their integration into teaching and learning, their success is questionable or at leastnot rigorously established (Akili 2007). Thus, our goal is to investigate the effectiveness ofthe game in order to explore its potential as an educational tool and to guide its evolution.Therefore, the paper describes a series of experiments with a first prototype of the game,which we run as part of a software measurement module in parallel in three graduatesoftware engineering lectures in Brazil.

The paper is structured as follows. Section 2 provides an overview on the educationalgame X-MED. Related work in this area is presented in Section 3. Section 4 describes theexperiment. Data collection and preparation are presented in Section 5 (including thecomplete data set in Appendix). Section 6 presents the data analysis, which is summarizedand discussed in Section 7. Conclusions are presented in Section 8.

Empir Software Eng (2009) 14:418–452 419

2 Educational Game X-MED

Our research interest is to develop a computer-based educational game on the definition andexecution of a software measurement program in a hypothetical real-life scenario. A firststep into this direction is X-MED v1.0 (Lino 2007), a computer-based educational gameprototype on software measurement. The objective of the game is to exercise theapplication of software measurement in the context of project management in alignmentwith maturity level 2 of the CMMI-DEV v1.2 (CMMI 2006) based on GQM-Goal/Question/Metric (Basili et al. 1994) and including elements from PSM—Practical Softwareand Systems Measurement (McGarry et al. 2001).

The instructional design of the game is being developed in accordance with theeducation-learning process proposed by (Percival et al. 1993). The learning objective ofthe game is to reinforce measurement concepts and to teach the competency to apply theacquired knowledge covering the cognitive levels remembering, understanding andapplying in accordance to the revised version of Bloom’s taxonomy of educationalobjectives (Anderson and Krathwohl 2001) (see Fig. 1).

By selecting adequate solutions from a set of pre-defined alternatives in order to solvepractical problems, students are supposed to learn how to develop or to select adequatemeasurement goals, GQM plans, data collection plans and how to verify, analyze andinterpret data. In accordance with those learning objectives, the game has been designed asa single-player environment, in which the learner takes the role of a measurement analystand defines and executes step-by-step a measurement program in a realistic scenario.

The game consists of three main phases, including the introduction to the game, the maingame play and finalization. The main part of the game X-MED v1.0 follows a linear flow ofactions in accordance to the GQM measurement process (as shown in Table 1). During thegame, the learner executes sequentially each of the steps of a measurement program, forexample, step 1—context characterization, step 2—identification of measurement goal, step3.1—definition of abstraction sheet and so on. Each of the steps follows the same sequenceby first presenting the task description, then providing information on the task, and thenrequesting the player to select the most adequate solution from a set of alternatives. Basedon the selected alternative, then a pre-defined score and feedback is presented.

In each step, a task is presented to the learner. For example, in step 2, the learner is askedto identify the most appropriate measurement goal for the given situation (Fig. 2).

Then, the game presents information and material for the respective task (e.g., productdescription, project plan excerpt, interview records, etc.). For example, in step 2, this

6. Creating: Putting elements together to form a coherent or

functional whole; reorganizing elements into a new pattern or

structure through generating, planning, or producing.

5. Evaluating: Making judgments based on criteria and standards

through checking and critiquing.

4. Analyzing: Breaking concepts into parts, determining how the parts relate

or interrelate to one another or to an overall structure or purpose, including

differentiating, organizing, and attributing.

3. Applying: Carrying out or using a procedure through executing, or

implementing.

2. Understanding: Constructing meaning from different types of functions like interpreting,

exemplifying, classifying, summarizing, inferring, comparing, and explaining.

1. Remembering: Retrieving, recalling, or recognizing knowledge from memory. Remembering is

when memory is used to produce definitions, facts, or lists, or recite or retrieve material.

Fig. 1 Revised version of Bloom’s taxonomy of educational objectives (Anderson and Krathwohl 2001)

420 Empir Software Eng (2009) 14:418–452

includes a textual recording of a brainstorming meeting on the identification ofmeasurement goals in the hypothetical software organization (see Fig. 3), besides additionalbackground information on the organization, its product and process.

Once the presented material has been analyzed by the learner, the game asks the learnerto take a decision with respect to the presented task. In order to control the complexity andvariability of such decisions in the measurement domain, in this initial version of X-MED,decisions are made by selecting the most adequate solution from a pre-defined set of sixalternatives. For example, in step 2, the task is to select the most adequate measurementgoal in the given situation from six potentially relevant measurement goals (see Fig. 4).

Once selected an alternative, the game immediately provides a pre-defined feedback andscore to the student depending on the selected alternative (Fig. 5).

As another example, Fig. 6, shows the game flow with respect to step 7. Data interpretation.Again, first the task is presented. Here the player has to select an adequate interpretation basedon the collected data and the recorded feedback session. Six alternative interpretations arepresented and once selected; the game presents the respective feedback.

Table 1 Overview on the flow of actions

Phase 1—Game introduction General explanation on the game’s objective andhow it works

Phase 2—Game execution Includes the following steps:Step 1—Context characterization Analysis of the context information on the

hypothetical software organization, its product,projects and software process

Task descriptionPresentation of materialSelection of task solution from 6 alternativesScore and feedbackStep 2—Identification of measurement goal Identification of the most adequate measurement goal in

the given situationTask descriptionPresentation of materialSelection of task solution from 6 alternativesScore and feedbackStep 3—Development of GQM Plan Includes the following steps:Step 3.1—Definition of Abstraction Sheet Definition of an Abstraction Sheet for one

measurement goalTask descriptionPresentation of materialSelection of task solution from 6 alternativesScore and feedbackStep 3.2—Identification of GQM questions Identification of two GQM questions based on the

Abstraction Sheet...Step 3.3—Definition of analysis models Definition of analysis models with respect to. two

GQM questionsStep 3.4—Specification of measures Specification of measures with respect to. two

analysis modelsStep 4—Development of data collection plan Definition of data collection procedures with respect

to two measuresStep 5—Data verification Verification of a set of collected dataStep 6—Data analysis Identification of adequate data analysis results with

respect to one GQM questionStep 7—Data interpretation Identification of adequate data interpretation results

with respect to one GQM questionPhase 3—Game finalization Presentation of total score and summary of feedback


Independent on which alternative was selected by the learner, the game advances to thenext step, using always the pre-defined correct decision of the previous step as a basis. Thismeans, that, given that a learner takes a wrong decision in a step, the game automaticallyuses the correct answer to continue the game.

In the end, a total score is calculated based on the sum of the partial scores and a finalreport with all partial scores and feedback is generated.

The target audience of the game is graduate students in computer science courses orsoftware engineering professionals. The game is designed as a complement to traditionalclassroom or e-learning courses providing an environment to exercise the presented

Record of the meeting for identification of measurement goal

Access to contextinformation

Fig. 3 Screenshot illustrating material for the identification of measurement goal

Game

sequence

Task description

Context

information

Fig. 2 Screenshot illustrating task description for the identification of measurement goal


concepts. It requires a basic understanding of software engineering, software measurement,project management and the CMMI framework. The game is intended to be usedindividually by a student without the need of interaction with other students or an instructor.The average duration of a game session is about 2 h.

We developed the game using evolutionary prototyping based on the ADDIE model—Analysis, Design, Development, Implementation, and Evaluation (Molenda et al. 1996).The material (including alternatives, scores and feedback) has been prepared based onmeasurement literature, reported experiences and our experiences on applying measurementin practice. So far, X-MED v1.0 consists of only one static scenario focusing on theapplication of software measurement for project management. In its current version, it is notconfigurable or customizable to other measurement approaches or scenarios.

Detailed feedback

Fig. 5 Screenshot showing de-tailed feedback

Asking the learner to select the most adequate measurement goal for the given situation

Measurement goal alternatives

Fig. 4 Screenshot illustrating the selection of the most adequate measurement goal


The prototype has been implemented with JAVA JDK 6 as a desktop system. All threephases and the seven steps as summarized in Table 1 have been implemented. Only textualand graphical elements have been applied, no multimedia elements (sound, video) wereintegrated. The game has been implemented in Brazilian Portuguese. A demo version of theprototype X-MED v1.0 is available at: http://www.incremental.com.br/xmed.

In this respect, X-MED v1.0 represents a simplified and limited version of aneducational game. Although, there are many perspectives on games for learning (Michaeland Chen 2006; Abt 2002; Prensky 2001), in accordance to common elements, such as,play, rules and competition, X-MED v1.0 can be regarded an educational game, albeitsimple, as it provides a contest (to achieve maximum score) in which the player operatesunder rules in accordance to measurement theory to gain a specified objective (adequatelydefine and execute parts of a hypothetical measurement program) based upon learning.

Based on (Ellington et al. 1982), X-MED v1.0 can be regarded a “game used as casestudy”, as it provides a game in which the learner has to examine a realistic scenario, wheres/he takes the role of a measurement analyst in a hypothetical software organization. In thisfirst version of the game, the narrative and flow of events is strictly linear, based on a seriesof constrained selections with pre-defined alternatives, following the procedural steps of aGQM-based measurement program. Yet, in accordance to (Greitzer et al. 2007), weconsider, even, such a relatively simple learning experience, a valuable opportunity forlearning through its problem-centered focus on content-related decisions.

4. Feedback

3. Interpretation alternatives

2. Record of feedback session

1. Task description

Fig. 6 Screenshots demonstrating game flow of step 7. Data interpretation


http://www.incremental.com.br/xmed

3 Related Work

The idea of adopting educational games for software engineering education is recent and, sofar, there do not exist games for teaching software measurement. However, there are severalgames available in other software engineering areas, mainly, project management(including, e.g., The Incredible Manager (Dantas et al. 2004), Project-o-poly (Buglione2007), management training courses (Collofello 2000)) or simulation games for theexecution of software processes, such as, SimSE (Oh Navaro and van der Hoek 2007)SESAM (Drappa and Ludewig 2000), SimVBSE (Jain and Boehm 2006), OSS (Sharp andHall 2000), Problems and Programmers (Baker et al. 2003), among others. Several of thosegames or teaching methods have also been evaluated with respect to their impact onsoftware engineering education.

One of the most comprehensive evaluations in this context has been done on theeducational game SimSE (Oh Navaro and van der Hoek 2007) for the creation andsimulation of software process models. As part of the research on the game, a multi-angledevaluation, including an initial pilot study, an in-class study, a comparative study (designedas formal experiment) and an observational study have been run to provide acomprehensive picture of SimSE’s overall effectiveness and a general understanding ofits strengths and weaknesses.

Another distinctive study is the externally replicated controlled experiment on thelearning effectiveness of using a process simulation model for educating computer sciencestudents in software project management (Pfahl et al. 2003). In the experiments, a pretest–posttest control group design was used, where the experimental group applied a systemdynamics simulation model and the control group used the COCOMO model as apredictive tool for project planning. The learning effectiveness has been analyzed based onthe scores in the tests and subjective improvement suggestions.

SESAM—Software Engineering Simulation by Animated Models (Drappa and Ludewig2000) has been evaluated through a case study and a controlled experiment in order toinvestigate, whether the simulation based on the SESAM model helps to improve projectmanagement education. Both studies used a pretest–posttest design comparing theperformance of the participants in terms of a questionnaire score, the preparation of aproject plan and results of the simulation runs.

Yet, other evaluations of educational games, generally, take into consideration only thelevel of reactions of the participants. For example, the usefulness of the projectmanagement game The Incredible Manager (Dantas et al. 2004) within a training concepthas been analyzed through two experimental studies. Within the studies, only subjectivefactors, such as, fun and interest as well as the identification of the game’s limitations anddrawbacks have been investigated. Another example is a large study, involving more than1,500 participants, on the investigation of a case study element of a course presentedthrough an innovative interactive multimedia simulation of a software house Open SoftwareSolutions (Sharp and Hall 2000). In this simulation, the student plays the role of anemployee and performs various tasks as a member of the company's project teams. Theevaluation covers usability aspects, such as, attractiveness, learnability, helpfulness, etc., aswell as positive and negative factors of the simulation.

Another example is the incorporation of empirical studies in three industry short courseson the effects of test-driven development (TDD) on internal software quality (Janzen et al.2007). The experiments compared an iterative test-first approach with an iterative test-lastapproach by analyzing various software metrics on software size, complexity, coupling,


cohesion, and testing. Additional data via pre- and post-experiment surveys was collectedon the programmer opinions of TDD.

In contrast, experiments regarding software measurement education are extremely rare.One example is a practical experiment in teaching software metrics (Thomas 1996).However, this experiment aimed at introducing productivity measurement as an integralpart of student software engineering projects and work assignments rather than evaluatingany specific teaching method.

4 Experiment

4.1 Research Objectives

Our motivation for the study is that, although, educational games are being recognized asinteresting teaching means, it remains open to what degree they indeed contribute tolearning (Akili 2007). Thus, as an explorative research, this series of experiments aims atproviding a first insight into the overall effectiveness of the prototype of the game X-MEDv1.0 as an educational tool. The research goal is to evaluate these aspects from theviewpoint of the researchers in the context of graduate software engineering lectures.

Our hypothesis is that the usage of the educational game X-MED has a positive learningeffect on the capability of learners to define and execute measurement programs for projectmanagement in alignment with maturity level 2 (ML2) of the CMMI-DEV. We expect apositive reinforcement effect regarding the remembering and understanding of measurementconcepts and a positive learning effect on the capability to apply the acquired knowledge.

A second research objective of this study is also to evaluate the appropriateness of thegame, in terms of its content, teaching method and duration as well as its engagement fromthe viewpoint of the learners. And, we also want to obtain a first feedback on the strengthsand weaknesses of the prototype of the game in order to guide its evolution.

4.2 Context

We have run a series of experiments within a module on software measurement as part ofsoftware engineering lectures in master courses on computer science. Within eachexperiment, we followed the same syllabus:

4.2.1 Syllabus of the Module “Software Measurement”

Context This module is planned to be part of a master course lecture or part of aprofessional training related to software engineering, software quality or software processimprovement.

Pre-requisites Students are expected to have at least a bachelor degree in computer science(or a related area) and a basic understanding on software engineering, project managementand the CMMI framework.

Module Objective The objective of this module is to provide a basic understanding on thedefinition and execution of software measurement programs for project management inalignment with maturity level 2 of the CMMI-DEV.


Module Description Measurement concepts and terminology, overview on standard ISO/IEC 15939 (measurement information model and measurement process model), overviewon measurement methods (GQM, PSM), measurement in reference models (CMMI-DEV),measurement process step-by-step: context characterization, definition of measurementgoals, development of GQM plan, development of data collection plan, data collection,verification and storage, data analysis and interpretation, communication of measurementdata and results.

Learning Outcomes As a result of this module, the student should have a basicunderstanding on software measurement and should be capable to define and executebasic measurement programs for project management under supervision. The learningobjective of the software measurement module of these university courses is directed tocognitive learning, including declarative and procedural knowledge considering theremembering, understanding and applying level in accordance with the revised version ofBloom’s taxonomy (Anderson and Krathwohl 2001). Upon completion of this module,students will have the ability to:

Level Learning outcomeRemembering Can students RECALL information? ▪ Recall measurement concepts and terminology:

measurement goal, measure, etc.▪ Cite relevant standards, models and methods.▪ Name the steps of the measurement process.

Understanding Can students EXPLAIN ideas? ▪ Classify and illustrate measurement elements,such as measures.

▪ Describe the measurement process.Applying Can students USE a procedure? ▪ Construct measurement programs focusing on

ML2 of the CMMI▪ Select adequate measurement elements in thedefinition and execution of measurement programsfocusing on ML2 of the CMMI.

Teaching Methods Different teaching methods are used, including a 4-h expository lecturewith in-class exercises and a 2-h application of the educational game X-MED. The mainfocus of the different teaching methods with respect to the intended knowledge levels isshown below:

Teaching method LevelRemembering Understanding Applying

Lecture xIn-class exercises x xGame x x

Assessment The achievement of the expected learning outcomes is assessed by written testswith multiple choice and open questions. Tests include questions on all three knowledgelevels (remembering, understanding and applying). Regarding application, the tests includequestions on the application of measurement with the same focus as in the game(monitoring project schedule, effort and cost) as well as transfer questions regarding the


application of measurement for monitoring other process areas associated to ML2 of theCMMI (software configuration management, software acquisition management, etc).

4.3 Experiment Design

We run a series of three experiments in parallel without any modifications in each of thelectures. In each of the experiments, we applied a classic experimental design (randomizedpre-test post-test control group design) (Takona 2002; Wohlin et al. 2000). In order tominimize the problem of selection bias, the distribution of participants to the experimentalor control group has been randomized in a balanced manner.

In each of the experiments, groups were equally trained with respect to basicmeasurement concepts and the measurement process through an expository lecture andin-class exercises by the same instructor. Then, both groups in each experiment took a pre-test prior to the application of the treatment. Afterwards, the treatment—the usage of thegame—was only given to the experimental group. No treatment has been applied to thecontrol group. In the end, both groups in each experiment took the post-test. The design ofeach of the experiments is as follows:

Group Training Pre-test Treatment Post-test(A) Experimental Random

assignmentLecture &exercise

Test 1 ———> Game ———> Test 2

(B) Control Randomassignment

Lecture &exercise

Test 1 —————————> Test 2

Our objective of this study was to analyze, if the game (as a complement to traditionallectures and exercises) has any impact on the learning outcome. Therefore, we applied thegame (as treatment) to the experimental group, only. Due to the fact that we did not intendto compare the game to any other teaching means (e.g., to evaluate, if it’s impact is greaterthan working on case studies), we did not apply any treatment to the control group.

4.4 Hypotheses and Variables

Our research questions are:

Research Question 1 Is the learning effect on the remembering, understanding and applyinglevel in the experimental group A higher than in control group B?

Research Question 2 Is the educational game considered appropriate in terms of contentrelevancy, correctness, sufficiency and degree of difficulty, sequence, teaching method andduration in the context for which it is intended? Is the game considered engaging? What areits strengths and weaknesses?

The objective of this research question is to obtain a subjective evaluation of theseaspects from the learners point of view, rather than to formally evaluate, e.g., thecorrectness or completeness of the game in accordance with measurement theory. Weselected these variables in order to help to identify any flaws in the game design, e.g.,missing elements aspect, which the learners consider important to be included. Yet,students, who are learning about software measurement, might not be able to providevaluable feedback on those issues and/or be biased. Due to those threats to validity, we do


not analyze those issues solely based on the students’ evaluations. For a comprehensiveanalysis of the correctness, relevancy and completeness of the game, reviews have beenperformed by software engineering experts during the development of the game.

In order to analyze these research questions, we adopt the Kirkpatrick's four-level modelfor evaluation (Kirkpatrick and Kirkpatrick 2006), a popular and widely used model for theevaluation of training and learning. This model essentially represents four successive levelsof more precise measures on the effectiveness of training programs, as presented in Table 2.

In accordance to Kirkpatrick’s four-level model for evaluation, we investigate bothresearch questions on level one: reacting, which focuses on how the participants feel aboutthe learning experience by collecting data via satisfaction questionnaires. On level 1regarding research question 1, we subjectively evaluate the perceived level of measurementcompetence of the participants (Variable Y.1 Measurement competency) between pre-andpost-test on an ordinal 6-point scale.

We investigate research question 1 also on level two: learning, which focuses on theevaluation of the increase in knowledge by administering a pre-and post-test. On level 2, weevaluate the learning effect separately for each of the knowledge levels (Y.2 Measurementknowledge on the remembering level, Y.3 Measurement knowledge on the understandinglevel, Y.4 Measurement knowledge on the applying level) by comparing the average scoresbetween pre-test and post-test (relative learning effect), and with regard to post-testperformance (absolute learning effect).

These expectations are formulated as follows (based on (Pfahl et al. 2003)):

(a) Relative learning effect: (Y.i;A)diff>(Y.i;B)diff, for i=1,…,4(b) Absolute learning effect: (Y.i;A)post>(Y.i;B)post, for i=1,…,4

with:

(Y.1;X)pre Variable Y.1 during pre-test of participants in group X (X=A or B).(Y.i;X)pre Average score on questions on variable Y.i (i=2,…,4) during pre-test of

participants in group X (X=A or B).

Table 2 Overview on Kirkpatrick's four-level model for evaluation (Kirkpatrick and Kirkpatrick 2006)

Level Evaluationlevel

Evaluation description and characteristics Examples of evaluation tools and methods

1 Reaction Evaluates how the students felt aboutthe training or learning experience

Happy-sheets; feedback forms; verbalreactions; post-training surveys; ...

2 Learning Evaluates the increase in knowledgeor capability (before and after)

Assessments and tests before and after thetraining; interviews or observation

3 Behavior Evaluates the extent of applied learningback on the job-implementation

Observation and interviews over time toassess change, relevance of changeand sustainability of change

4 Results Evaluation of the effect on thebusiness or environment by the trainee

Long-term post-training surveys; observationas part of ongoing, sequenced training andcoaching over a period of time; measures,such as, re-work, errors, etc. to measure, ifparticipants achieved training objectives;interviews with trainees and their managers,or their customer groups


(Y.1;X)post Variable Y.1 during post-test of participants in group X (X=A or B).(Y.i;X)post Average score on questions on variable Y.i (i=2; …,4) during post-test of

participants in group X (X=A or B).

Y :i;Xð Þdiff¼ Y :i;Xð Þpost� Y :i;Xð Þpre:The related null hypotheses are stated as follows:

H01a There is no significant difference in relative learning effectiveness between group A(experimental group) and group B (control group).

H01b There is no significant difference in absolute learning effectiveness between group Aand group B.

On level 1, we also evaluate subjectively the perceived learning effect of the game (Y.5Subjective learning effect) considering Y.5.1 Learning effect on measurement concepts andprocess andY.5.2 Learning effect onmeasurement application, each on a four-point ordinal scale.

Regarding research question 2, we subjectively evaluate the perceived level ofappropriateness (Y.6 Appropriateness) of the game by asking the participants to evaluateseven dimensions: Y.6.1 content relevancy, Y.6.2 correctness, Y.6.3 sufficiency, Y.6.4difficulty, Y.6.5 sequence, Y.6.6 teaching method and Y.6.7 duration, each on a four-pointordinal scale. We regard the game as appropriate from the viewpoint of the learners, if allparticipants evaluate all dimensions at least as good. Specifically with respect to correctness,content relevancy, and sufficiently, the objective here is to identify any shortcomings from theviewpoint of the learners. We also evaluate on level 1, the engagement of the game (Y.7Engagement) by subjectively evaluating, if the participants liked the game (Y.7.1Satisfaction) and (Y.7.2 Fun) each on a four-point ordinal scale. We regard the game asengaging, if all participants at least liked the game and had fun while playing.

Table 3 summarizes the research questions, instruments and variables with respect to theevaluation levels.

Within the context of the study, we considered the following disturbing factors:

DF.1 Personal background in terms of academic formation, training and practicalexperience.

DF.2 Motivation based on the perceived importance of measurement and the participant’sinterest in learning more about measurement, each on a four-point ordinal scale.

DF.3 Additional study time spent besides the lecture, in-class exercises and gameapplication on a 6-point ordinal scale.

Further information on the variables can also be found in Table 6.

4.5 Execution

4.5.1 Participants

We run the series of experiments in the context of lectures in computer science mastercourses in Brazil. In order to obtain a larger number of participants, we performed the sameexperiment in parallel as part of the following lectures:

▪ “Software Process Improvement” during the 3. Trimester 2007 of the Graduate Programin Computer Science at the Federal University of Santa Catarina/Florianópolis.


▪ “Software Engineering” during the 2. Semester 2007 at the Master Program in AppliedComputer Science at the UNIVALI—Universidade do Vale do Itajaí/São José.

▪ “Software Quality and Productivity” during the 2. Semester 2007 at theMaster Program inApplied Computer Science at the UNIVALI—Universidade do Vale do Itajaí/São José.

In total, 15 students participated in the series of experiments and all participantscompleted the experiment. As part of their participation, the students earned educationalcredits. Table 4 summarizes the personal characteristics of each of the student groups.

4.5.2 Procedure and Materials

The experiments were conducted following the schedule presented in Table 5, as part of theregular classes in weekly intervals.

After a short presentation of the teaching plan, the experiment’s purpose and generalorganizational issues, the participants were asked to sign a consent form and to answer aquestionnaire on their personal background. The questionnaire was composed of questionson their level of formation, professional experience and present position and research focuswithin the master program. The students also answered questions focusing on knowledgeand previous training on software measurement and their motivation to learn aboutmeasurement and to participate in the experiment. Then, an expositive lecture with small in-class exercises was realized. Exercises included a crossword puzzle on measurementconcepts and a continuous exercise on the identification of measurement objectives, thedevelopment of an abstraction sheet and the definition of data collection procedures. Thelecture was hold by the same instructor in all three experiments. Afterwards, feedback onthe lecture and exercises was collected through a questionnaire.

At the second encounter, the pre-test was conducted to establish a baseline for theassessment of the learning effect. The pre-test was composed of 14 questions: four on theremembering level, three on the understanding level, five related to the application level in

Table 3 Evaluation overview

Kirkpatrick’s evaluation level

1 2

Instruments Satisfaction questionnaire Pre-/post-testResearch question 1. Is the learning effecton the remembering, understandingand applying level in the experimentalgroup A higher than in control group B?

Y.1 Measurement competency Y.2 Measurement knowledgeon the remembering level

Y.5 Subjective learning effect Y.3 Measurement knowledgeon the understanding level

Y.4 Measurement knowledgeon the applying level

Research question 2. Is the educationalgame considered appropriate in terms ofcontent relevancy, correctness,sufficiency and degree of difficulty,sequence, teaching method and durationin the context for which it is intended?Is the game considered engaging? Whatare its strengths and weaknesses?

Y.6 Appropriateness –Y.7 Engagement


the same domain as the game and two transfer application questions to other contextsassociated to ML2 of the CMMI. In addition, subjective data was collected on potentialdisturbing factors as well as the subjective evaluation of the measurement competencylevel. After the pre-test the assignment of participants to either the experimental or control

Table 4 Overview on personal background of the participants

Personal characteristics SPI/UFSC SE/UNIVALI SQP/UNIVALI

Number of participants 8 5 2Average age [years] 27 37 38Academic formation [in number of participants]Bachelor in Computer Science 4 2 1Bachelor in Computer Engineering 2Bachelor in Information Systems 2 1Bachelor in other area 2 1Specialization in Computer Science 3 4 2Professional certifications [in number of participants]Implementer of the Brazilian Modelfor Software Process Improvement MPS.BR

1 1

Brazilian Certification for Software Testing 1Professional experience [average in years] asSoftware analyst or developer 2.4 7 4Project manager 0.6 0.6 0SEPG member 0.3 0.6 0Software quality group member 0.3 0.4 0Software process consultant 0.3 0.4 0Software Engineering professor 0.1 3.2 2.5Current professional position(multiple options possible) [in number of participants]Software analyst 2 0 1Software developer 1 0 0Tester 0 1 0Project manager 0 1 0SEPG 1 0 0Consultant 1 1 0Professor 0 1 2Not working in the software domain 3 1 0Work load [in number of participants]Full-time student 2 0 0Working approx. 20 h/week 2 1 1Working approx. 40 h/week 4 4 1Applied measurement in practice[in number of participants]No 7 4 2Yes 1 1 0Participated already in training courses on(with more than 30 h) [in number of participants]Software measurement 0 0 0Software Quality 0 3 1CMMI 0 1 0MPS.BR 2 1 0Project management 1 2 1


group was done randomly and communicated to the participants. Results of the pre-testswere published only in the end of the experiment together with the results of the post-test.

At the next encounter, only the experimental group received a brief introduction on thegame and, then, played one game session, covering all steps of the game execution (step 1–step 7). In the end, the automatically generated and encrypted log file on the performance ofeach participant was collected. Further subjective data on the game was collected via aquestionnaire, including questions on the evaluation of the game, strengths & weaknessesand the subjective evaluation of the learning effect as well as the motivation for playing.The control group underwent no treatment.

At the last encounter, both groups completed the post-test. Again, subjective data wascollected on potential disturbing factors as well as a subjective evaluation on thecompetency level and the perceived learning impact of the game.

The in-class exercises, the game tasks as well as the questions of the pre-and post-testhave been designed similar in style, content and difficulty. An example of such a task is:

Imagine a company that develops e-government systems and, which initiated last year asoftware process improvement initiative in alignment with ML 2 of the CMMI. One of themain characteristics of the company is that it frequently sub-contracts other softwarecompanies in order to supply parts of its systems. Among the suppliers are the companies e-law, big-law and best-law. In the past, there have been many problems with the productssupplied by e-law, which demonstrated several defects. These defects are mainly detectedduring integration testing. Therefore, one of the primary focuses is on the process areaSupplier Agreement Management. Now, senior management wants to observe trends on thequality, mainly, reliability, of the products supplied for the organizational unit e-blob and,therefore, requests that you establish a measurement program. What would be an adequatemeasurement goal in this context?

Object Purpose Quality focus Viewpoint Context□ characterize□ monitor□ evaluate□ control□ predict□ identify causal relationships

Table 5 Schedule of the experiments

Day Content Forms Duration

1. Day Presentation of teaching plan 5 minConsent form 5 minBackground questionnaire 10 min

Lecture & in-class exercises 3:30 hPost-lecture questionnaire 15 min

2. Day Pre-test 1 hTest questionnaire 5 min

3. Day Game×(experimental group only) Game log file 2 hPost-game questionnaire 30 min

4. Day Post-test 1 hTest questionnaire 5 min


5 Data Collection and Preparation

The raw data for variable Y.1 and DF.3 was collected through questionnaires in parallel tothe pre- and post-test. Data for the variables Y.2–Y.4 was collected during the pre- and post-tests. Data on Y.5, Y.6 and Y.7 was collected through a post-game questionnaire. Data onDF.1 and DF.2 was collected in the beginning of the experiment through a backgroundquestionnaire. The raw data was treated, as described in Table 6, in order to prepare dataanalysis.

6 Analysis

6.1 Data Analysis Procedure

In order to try to obtain a greater accuracy and statistical power by increasing the samplesize, we studied the combined results of the series of the experiments. Therefore, weperformed a joint analysis, cumulating the data from the individual experiments into onedataset. As the three experiments were run in parallel by the same research group in similarcontexts under the same conditions (research hypothesis, experimental design, treatments,material and measurement instruments, etc.), we consider them as one “big” experiment.Such a procedure is applicable in this specific case, as the data of the individual studies issimilar and homogeneous enough to be combined. We also regard the personal backgroundof the participants in all three studies comparable (Table 4), with only a slightly higherexperience across the SE/UNIVALI group. Anyway, we deal with differences in thepersonal background of the participants by analyzing the difference between pre- and post-test, instead of considering only absolute post-test results.

In this series of experiments, our primary concern for accuracy and to overcomeproblems with statistical power is due to the very small sample size, even, when combiningthe data from the 3 studies (n=15). For such small data sets, it is, basically, impossible totell, if the data comes from a variable that is normally distributed (Levin and Fox 2006), aswith small sample sizes (n<20), tests of normality may be misleading. In this situation,nonparametric tests are an appropriate approach. Inspecting the data distribution usinghistograms and box plots, we could not assume a normal distribution of the variables.Therefore, we use a non-parametric test (one-tailed Mann–Whitney U) as it is consideredthe most powerful non-parametric alternative to the t-test for independent samples. Due tothe small sample size, we also do not use a z-value to approximate the significance level,but compared the minimum U to U values from standard reference tables (Levin and Fox2006).

Usually, the commonly accepted practice is to set α=0.01. However, controlling a TypeI error (α) and Type II error (β) requires either a large effect size or large sample sizes. But,if neither, effect size nor sample size, can be increased to maintain a low risk of error, theonly remaining strategy is to permit higher risk of error (Lipsey 1990). Thus, we decide toset α=0.05.

In addition, we use descriptive statistics to analyze research question 1 and 2 in order toprovide summaries and to identify central tendencies and dispersion on the variables. Withregard to variables Y.2, Y.3, Y.4 we analyze mean, median and standard deviation. Forvariable Y.1, Y.5, Y.6 and Y.7, we calculate the median and interquartile range, as the valuesof the variables are measured on ordinal scales.


Table 6 Overview on variables

Independent variable X.1 Game A (experimental group)B (control group)

Dependent variables Y.1 Measurement competency(pre/post)

0—I know nothing0.2—I have a vague notion0.4—I know the basics0.6—I can apply measurement in practicewith assistance0.8—I can apply measurement practicewithout assistance1—I am a specialist

Y.2 Measurement knowledgeon the remembering level(pre/post)

Average score from 4 questions of thepre/post test ∈ [0,1]

Y.3 Measurement knowledgeon the understanding level(pre/post)

Average score from 3 questions of thepre/post test ∈ [0,1]

Y.4 Measurement knowledgeon the application level(pre/post)

Average score from 7 questions ofthe pre/post test ∈ [0,1]

Y.5 Subjective learning effect Y.5.1 Learning effect on measurementconcepts and process: 4—very much,3—much, 2—little, 1—nothing

Y.5.2 Learning effect on measurementapplication: 4—very much, 3—much,2—few, 1—nothing

Y.6 Appropriateness Y.6.1 Content relevancy: 4—excellent,3—good, 2—fair, 1—unsatisfactory

Y.6.2 Correctness: 4—excellent,3—good, 2—fair, 1—unsatisfactory

Y.6.3 Sufficiency: 4—excellent, 3—good,2—fair, 1—unsatisfactory

Y.6.4 Difficulty: 4—excellent, 3—good,2—fair, 1—unsatisfactory

Y.6.5 Sequence: 4—excellent, 3—good,2—fair, 1—unsatisfactory

Y.6.6 Teaching method: 4—excellent,3—good, 2—fair, 1—unsatisfactory

Y.6.7 Duration: 4—excellent, 3—good,2—fair, 1—unsatisfactory

Y.7 Engagement Y.7.1 Satisfaction: 4—liked a lot,3—liked, 2—liked little, 1—did notlike the game

Y.7.2 Fun: 4—lots of fun, 3—fun,2—little fun, 1—none

Disturbing factors DF.1 Personal background For each participant we calculated ascore on his/her personal backgroundby adding a count of 1 for each of thefollowing items present:B.Sc. in computer Science or related areaSpecialization in computer scienceor related area


6.2 Hypothesis Testing

Regarding research question 1 on the learning effect of the educational game, we analyzeresearch hypotheses H01a and H01b. Table 7 shows the results on testing hypothesis H01a:There is no significant difference in relative learning effectiveness between group A(experimental group) and group B (control group) by using a one-tailed Mann–Whitney Utest with α=0.05.

For significance level α=0.05, the critical value Ucritical=13. For none of the variables,Umin<Ucritical and, therefore, the difference is not significant at the 5% level and we cannotreject the null hypothesis H01a.

Table 6 (continued)

Independent variable X.1 Game A (experimental group)B (control group)

Professional Certification related toSoftware Engineering

Training (with more than 30 h) insoftware measurement

Training (with more than 30 h) insoftware quality

Training (with more than 30 h) in CMMITraining (with more than 30 h) in MPS.BRTraining (with more than 30 h) inproject management

Experience (of more than 1 year) asSW developer/analyst

Experience (of more than 1 year) asSW project manager

Experience (of more than 1 year) asSEPG member

Experience (of more than 1 year) asSW quality responsible

Experience (of more than 1 year) asSW improvement consultant

Experience (of more than 1 year) asSW engineering professor

Experience of applying measurementprogram in practice

DF.2 Motivation DF.2.1 Importance of measurement:1—very important, 0.66—important,0.33—less important, 0—not important

DF.2.2 Interest in learning more onmeasurement: 1—very interested,0.66—interested, 0.33—less interested,0—not interested

DF.3 Additional studytime (pre/post)

0—none, 0.2—about 30 min,0.4—approx. 1 h, 0.6—approx. 2 h,0.8—approx. 5 h, 1—more than 5 h

The complete data is documented in Appendix


Table 8 shows the results of testing H01b: There is no significant difference in absolutelearning effectiveness between group A and group B using a one-tailed Mann–Whitney Utest with α=0.05.

For significance level α=0.05, the critical value Ucritical=13. For Y.1post, Y.2post andY.3post, Umin>Ucritical and, therefore, the difference is not significant at the 5% level and wecannot reject the null hypothesis H01a.

For Y.3post, Umin<Ucritical and, therefore, the difference can be considered significant atthe 5% level and we can reject the null hypothesis H01a. However, the observed differenceis not in the predicted direction, as scores from group B (control group) are higher thanscores from the experimental group A.

6.3 Descriptive Statistics

Applying descriptive statistics, we also cannot identify any difference regarding the centraltendency of Y.1 between group A or B. Yet, an interesting observation is that regarding thisvariable representing a subjective auto-evaluation of the student’s measurement competen-cy, two students of group A and one of group B evaluated their competency less after thepost-test than after the pre-test (Table 9).

Regarding Y.2. Measurement knowledge on the remembering level, only very fewstudents of either group presented a small improvement. The mean value of group A evenindicates a score reduction. Regarding Y.3. Measurement knowledge on the understandinglevel and Y.4. Measurement knowledge on the application level there also cannot beobserved any significant difference between both groups (Table 10).

Analyzing the subjective assessment of learning effect of the game (Y.5) (participants ofthe experimental group A only), we can observe that most students indicated that theylearned much on software measurement concepts and process (Y.5.1). Opinions on thelearning effect on the application of measurement are more widespread, but the majorityalso indicated that they learned much on this topic. This, however, may contradict in partsour initial hypothesis that the learning effect of the game is more targeted towards theapplication of measurement than towards concepts and process (Fig. 7).

Regarding research question 2, we analyzed, if the educational game has beenconsidered appropriate in terms of content relevancy, correctness, sufficiency and degreeof difficulty, sequence, teaching method and duration in the context for which it is intended.The objective of this research question was to capture a subjective evaluation of these

Variable Rank A Rank B Umin=min (U, UI)

Y.1diff 52 67.5 16Y.2diff 52.5 67.5 16.5Y.3diff 74 46 18Y.4diff 68 52 24

Table 7 Results for H01a be-tween group A and B

Variable Rank A Rank B Umin=min (U, UI)

Y.1post 61.5 58.5 25.5Y.2post 51 69 15Y.3post 48 72 12Y.4post 49.5 70.5 13.5

Table 8 Results for H01b be-tween group A and B


aspects from the learner’s point of view. Figure 8 presents the results regarding thedimensions that address the participants’ perception on the game’s appropriateness.

As indicated in the chart, the highest rated dimensions were Y.6.1 content relevancy andY.6.5 sequence. The least rated dimensions were Y.6.4 difficulty and Y.6.6 teachingmethod, followed by Y.6.7 duration. These three dimensions also showed the greatest levelof disagreement among the participants.

Regarding the question on, if the participants regard the game engaging, we analyzedY.7.1 Satisfaction as shown in Fig. 9 and Y.7.2 Fun presented in Fig. 10.

The majority of the students either liked or liked the game a lot. However, at the sametime, most of them indicated that they had only little fun playing the game.

6.4 Qualitative Analysis

In open-ended questions, participants were asked to identify at least two weaknesses andstrengths of the game. Concerning strengths, the most cited aspect was the way howfeedback is given immediately after each task during the game (four citations) and thedegree of information provided with the feedback (two citations). Two participants alsohighlighted the logical sequence of steps in the game. Participants also cited as strength, thesimulation of meetings as part of the measurement process in the game (two citations).

As a main weakness, three participants cited the number of alternatives for each questionin the game, which were considered too many, especially as some of the alternativedescriptions are extensive (e.g., complete abstraction sheets). This had an effect of worn-outon the participants, which started to loose interest. One participant evaluated the duration ofthe game as too long. Other weaknesses are mainly related to the ergonomics of theprototype, which has limited functionalities, including:

– inability to navigate within the tasks of the game (four citations), e.g., the prototype of thegame does not permit to return to a previous task within one game session;

Y1.diff

Group AMedian 0.0Interquartile range 0.2Group BMedian 0.0Interquartile range 0.0

Table 9 Descriptive analysisof Y.1

Y2.diff Y3.diff Y4.diff

Group AMean −0.125 0.17 0.05Median 0.0 0.19 0.03SD 0.38 0.03 0.09Group BMean 0.11 0.07 0.02Median 0.0 0.0 0.09SD 0.20 0.17 0.15

Table 10 Descriptive analysis ofY.2, Y.3 and Y.4


– size and type/variety of fonts used for textual information (two citations);– usage only of (long) texts without animations for the presentation of information (onecitation).

In open-ended questions, participants were asked to identify improvement suggestionswith regard to the content as well as the game’s format. By far the most common suggestionwas related to the enhancement of multimedia usage in the game, as we used text only inthe prototype. Five participants suggested to vary the representation of information, usingimages, audio, video and animations. Two participants also suggested to make the gamemore interactive. An important issue was the improvement of the usability of the game,especially, with regard to more navigation flexibility (three citations) as well as theimprovement of the size/type of text fonts used (three citations). One participant requestedthe simplification of the text information itself. An interesting feedback was the suggestionto enable the revision of (wrong) answers in order to give the learner a second chance toanswer the question correctly. One participant suggested to reduce the size/duration of thegame.

0

1

2

3

4

5

6

Content relevancy

Correctness Sufficency

No.

de

part

icip

ante

s

Difficulty Sequence Teaching method

Duration

excellent good fair unsatisfactory

Fig. 8 Results on Y.6 Appropriateness (n=8)

0

1

2

3

4

5

6

Concepts and Process Application

Very much Much Few Nothing

Fig. 7 Results on Y.5 Subjectivelearning effect (n=8) (experi-mental group only)


6.5 Threats to Validity

Four types of threats can compromise the validity of an experiment (Wohlin et al. 2000):internal, external, construct and conclusion validity:

Internal Validity Internal validity is the degree to which changes in the dependent variablescan be safely attributed to changes in the independent variables. Threats to internal validityare influences that may indicate a causal relationship, although there is none. A possiblethreat to internal validity is that the treatment groups behave differently because of aconfounding factor, such as, difference in skills, experience or motivation, instrumentationeffect or maturation effect.

We used randomization to assign participants to groups to avoid a selection effect.Table 18 shows the analysis of the participants’ background in terms of formation,experience and training and Table 19 the analysis of their motivation. The results show no

0

1

2

3

4

5

6

Fun

No.

de

part

icip

ante

s

Lot of fun Fun Little fun No fun

Fig. 10 Results on Y.7.2 Fun(n=8)

0

1

2

3

4

SatisfactionN

o. d

e pa

rtic

ipan

tes

Liked a lot Liked Liked a bit Did not like

Fig. 9 Results on Y.7.1 Satisfac-tion (n=8)


significant differences between the groups. Thus, we assume that both groups arestatistically equal. In addition, any potential bias induced by differences in pre-test scoresand experience were tried to be neutralized by analyzing the differences between pre-testand post-test scores.

Another aspect is that unplanned events occurring concurrently with the treatment couldcause the observed effect. This concerns any kind of learning, which might have occurredthrough any activity outside of the experiment, e.g., by studying at home for the tests. Withrespect to this concern, participants were instructed not to study software measurementoutside of the experiment for the time being. Anyway, we tried to capture any time studiedfor each test through variable DF.3. Despite these precautions, there is no guarantee thatsubjects did not study more than indicated or acquired further software measurementknowledge in another way.

Due to the experimental design, in which no treatment was applied to the control group,a positive learning effect in the experimental group may be simply attributed to the extratime spent on the material by playing the game in comparison to the control group, whospent no extra time on a treatment related to the measurement.

Non-random drop-out of subjects has been avoided by the experimental design, i.e.,assignment of groups only after the pre-test before the treatment, and not in the beginningof the experiment. In fact, all participants completed the experiment.

A maturation effect can have been caused, as participants knew before the pre-test that atthe end of the experiment they would complete a post-test with similar questions. However,all pre-testes were collected and only returned with feedback after the administration of thepost-test, reducing such a possible effect. In addition, taking the pre-test can also havecaused a learning effect. We tried to minimize this by presenting similar questions alreadyas exercises during the expositive lecture. However, an inverse effect could also have takenplace as, specifically for the control group, a longer interval between the lecture and thepost-test than between the lecture and the pre-test can have caused a reduction of scores,especially, on the remembering level and, if no additional studying took place. Therepetition of the test may also have caused a loss of enthusiasm or motivation, as theparticipants had to repeat the tests.

Care was taken in the preparation of the pre- and post-test so that artifacts were designedcorrectly and of comparable content and difficulty. The tests were marked by one instructorrevising question by question without any information on the participants, in order toreduce any bias in the revision of the tests.

External Validity Threats to external validity reduce the ability to generalize the results tothe population under study and other settings outside the study. Possible treats identified are:

The experiments were run with a very small set of participants and by the same researchgroup, which may reduce the generalization of the results.

The experiments took place in an academic environment, and therefore, a generalizationof the results from students to professionals may be limited. Yet, this may not be a criticalthreat here, as about 75% of the participants were software professionals working at leastpart-time and studying in parallel.

Another threat may be that the tasks in the game as well as the tests simplify thedefinition and execution of a measurement program, as they deal more with the selection ofadequate work products than their construction from scratch, as required in a real situation.And, due to the short time frame available, the tasks in the game as well in the tests werelimited and may not be representative for complete measurement programs. However, we


cautiously prepared the material used in the game and tests based on practical experiencescovering, as much as possible, the complete measurement process.

Furthermore, the very nature of training courses indicates that participants are likely tobe immature in the use of the particular topic being taught. In addition, due to curricularconstraints, the amount of training and, hence, the acquisition of basic knowledge onmeasurement may have been limited. This may have rendered a situation in which learningof application knowledge could not take place and, therefore, may have reduced theobserved effectiveness of the game in comparison to a situation with more qualifiedmeasurement professionals.

Construct Validity Construct validity is the degree to which independent and dependentvariables measure accurately the concepts they purport to measure. The following issuesassociated with construct validity have been identified:

Due to practical constraints, running the experiment as part of an academic lecture,learning effects could only be assessed on level 1 and 2 in accordance to Kirkpatrick's FourLevels of Evaluation (Kirkpatrick and Kirkpatrick 2006). Potential learning effects weremeasured through the post-test results as well as the subjective perception of theparticipants on their measurement competency. However, this might not adequately capturethe real learning effect. Especially, as we intend to measure knowledge on the applicationlevel, this may not be sufficient to measure, if the participants learned to use the newlyacquired knowledge in their everyday environment.

Another possible threat is that aspects, such as, game appropriateness and engagementare difficult issues to measure and were captured through subjective measurements. Tocounteract this threat to validity, at least partially the instruments for measuring thesevariables were derived from measurement instruments that have been applied in othersimilar studies (Pfahl et al. 2003; Oh Navarro and van der Hoek 2007).

A significant threat to validity is also the subjective evaluation of aspects, such as,correctness, relevancy and completeness of the game in accordance with measurementtheory. Students, who are in the process of learning about measurement, might not be ableto provide valuable feedback on those issues and/or be biased. Therefore, we did notanalyze these variables solely based on the students’ feedback. For a comprehensiveanalysis of the correctness, relevancy and completeness of the game, reviews have beenperformed by software engineering experts during the development of the game.

Conclusion Validity Conclusion validity is concerned with ensuring that there is a statisticalrelationship between the treatment and the outcome.

Performing a joint analysis of the data collected in the series of experiments may havecaused an effect, called simpson’s paradox. This refers to a phenomenon where anassociation between a pair of variables observed in individual datasets can consistently beinverted, when the datasets are combined. Yet, due to the fact, that the series of experimentswas rather run as one “big” experiment and considering potential disturbing factors, suchas, participants’ background, motivation and additional study time as comparable, cautionwas taken to generate meaningful results.

Yet, even, when combining the data, the very small sample size represents a threat,which basically impedes any valid demonstration of statistical relationships. However,instead of simply abandoning the research, we adopted more robust statistical testsaccepting a low statistical power. Once the expected hypotheses could not be confirmed, wedid not modify or influence the results in any way in order to prevent any “fishing effect”.


Another aspect may be the reliability of measures used. For example, the test scores mayhave been too limited to represent a learning effect due to a relative small set of questionsused in the tests.

As this is an exploratory study to gain first insights on the learning effectiveness of theeducational game X-MED, we accept that the significance of results is weak due to thosethreats to validity.

7 Analysis Summary and Discussion

Based on these results, we cannot conclude that the game contributes to a positive learningeffect. Although, most participants, subjectively believe that playing the game helped themto learn, results from the statistical tests do not support these subjective evaluations. Theobvious explanation for these results is that the game in its current prototype form is notadequate to support learning effectively. Considering the subjective evaluations of theparticipants on the appropriateness of the game, especially the teaching method may have tobe revised as well as the usability of the game.

Another explanation may be that the degree of difficulty of the game requires a morecomprehensive understanding of software measurement, project management and theCMMI framework than the one learners were able to acquire during the 4-h lecture. Severalparticipants (10 citations) commented that they experienced the lecture as very dense andsuggested to increase the duration of the lecture before playing the game.

In the experiment, the participants played the game in the classroom without any breaks.The average time spent on game playing was about 100 min. As this is beyond the typicalconcentration span of about 20 min, participants may have lost concentration andmotivation. This may have been further provoked through the current format of theprototype, using text only, as well as the number and extend/size of possible alternatives.

In the experiment, all participants played the game only once, as in its current version,the game only consists of one scenario. This might not be sufficient to result in a significantlearning effect. It may be possible that participants will only benefit from repeated exposureto different scenarios.

Other causes may be related to characteristics of the experiment. One obvious reason isthat, due to the very small sample size, it may not be possible to identify statisticalsignificant results. In addition, the relative small number of questions in the pre-/post-testsmay have impeded the observation of a larger effect size.

Another factor, which may have influenced the test scores, is the additional timeparticipants of both groups spent on studying for the tests (Table 20). However, based onthe information provided by the participants, participants of group A studied slightly morefor the tests than participants of group B, which one would expect to cause an even largerlearning effect on group A.

However, comparing the test scores (Y.2–Y.4) with the subjective evaluation ofmeasurement competency, an interesting aspect can be observed. Two participants ofexperimental group and one participant of the control group evaluate their measurementcompetency lower at the time of the post-test than the pre-test, although their tests scores(Y.2–Y.4) improved. This may indicate that the tests as well as the game contributed toreconsider their initial evaluation, which may have been overestimated in relation to theirreal competency.

In general, participants evaluate the game prototype as appropriate and engaging,although we kept this prototype quite simple, as our primary focus was to obtain a first


feedback rapidly. Most important improvement suggestions are related to the usage ofmultimedia elements to increase the appeal of game. And, although most participants didnot experience playing the game as fun, they commented that they preferred the game to apaper exercise. Thus, increasing the attractiveness and fun factor of the game may furtherhelp to increase also the learning impact.

Results as those obtained in our study can also be observed in other related research.For example, in the study on the SESAM system (Drappa and Ludewig 2000) a learningeffect could also not been confirmed. This can be explained by the fact that evaluation inthe education domain is difficult. Due to multiple interacting factors, it is often impossibleto isolate the effects of an educational technique, to track the real learning effectlongitudinally and/or to get statistically significant results (Oh Navarro and van der Hoek2007; McNabb et al. 1999; Almstrum et al. 1996). In addition, the immaturity of thesoftware engineering domain creates difficulties in conducting comparative evaluations (OhNavarro and van der Hoek 2007).

8 Conclusion

This study was explorative in nature, and, although, we could not statistically demonstrate alearning effect, subjective evaluations indicate the potential of such a game to supporteducation. In addition, the study provided first insights on the game and its main strengthsand weaknesses, which will systematically guide its further evolution. And, although, ourresults have answered some of our initial questions, they have also raised new questionsand exposed issues that need to be addressed through further studies. Thus, we are planningto repeat the experiment with certain modifications to the initial training in order to enablethe acquisition of a more comprehensive understanding as well as modifications to theexperiment material and the game itself.

Based on the feedback obtained, we are, currently, evolving the conception of the gameby enhancing its complexity and variability in the direction of a simulation game. Thisincludes, mainly, its modification to a nonlinear sequence of actions through theconstruction of a simulation model and, consequently, the increase of available scenariosand options. We expect that this will allow a more dynamic learning experience and enablethe playing of various game sessions without repeating the same scenario. In addition to thefeedback on usability aspects obtained through this study, we are also evaluating andimproving the usability and design of the game based on usability frameworks, such as,(Bastien and Scapin 1993; Malone 1982; ISO 2008). We are also preparing multimediacontent, including, e.g., 3D animation videos of the interview and meetings. Including thesemodifications of the game’s conception, we are re-implementing a web-based version of thegame software. Once the enhanced version of the game becomes available, we will repeatthe experiment. In this context, the results of the study presented in this paper will also bevaluable as a baseline for comparison.

Acknowledgements The authors would like to thank Juliana I. Lino for her work on the conception of theinitial version of the game and Leonardo Steil and Djoni Silva for the implementation of the prototype. Aspecial thanks also to all the students of the master courses, who participated in the experiments. We wouldalso like to thank Emily Oh Navarro for sharing material and we are grateful to Sílvia M. Nassar for adviceon the statistical analysis. We would also like to thank the anonymous reviewers of a previous version of thispaper for their valuable comments and suggestions.

This work was supported by the CNPq (Conselho Nacional de Desenvolvimento Científico eTecnológico), an entity of the Brazilian government focused on scientific and technological development.Further support was provided by the UNIVALI—Universidade do Vale do Itajaí/Brazil.


Appendix

Table 11 Data on Y.1 Measurement competency

Participant X.1 Group Y1.pre Y1.post Y1.diff

1 A 0.4 0.4 02 A 0.6 0.4 −0.23 B 0.6 0.6 04 B 0.4 0.6 0.25 B 0.6 0.4 −0.26 A 0.6 0.4 −0.27 A 0.6 0.6 08 B 0.8 0.8 09 A 0.6 0.8 0.210 B 0.6 0.6 011 B 0.6 0.6 012 A 0.6 0.6 013 A 0.6 0.6 014 A 0.4 0.6 0.215 B 0.4 0.4 0

Table 12 Data on Y.2 Measurement knowledge on the remembering level


1 A 0.75 0.75 02 A 1 0.25 −0.753 B 1 1 04 B 1 1 05 B 0.75 1 0.256 A 0.75 0.75 07 A 0.5 0.5 08 B 0.75 0.75 09 A 0.5 0.5 010 B 0.75 0.75 011 B 0.5 0.5 012 A 0.5 0.25 −0.2513 A 0.75 0.25 −0.514 A 0.5 1 0.515 B 0.5 1 0.5


Table 13 Data on Y.3 Measurement knowledge on the understanding level


1 A 0.35 0.68 0.332 A 0.27 0.58 0.313 B 0.92 0.92 04 B 0.58 0.9 0.325 B 0.77 0.62 −0.156 A 0.45 0.52 0.077 A 0.45 0.67 0.228 B 0.5 0.5 09 A 0.42 0.58 0.1610 B 1 0.92 −0.0811 B 0.6 0.78 0.1812 A 0.25 0.53 0.2813 A 0.67 0.75 0.114 A 0.78 0.68 −0.115 B 0.62 0.83 0.21

Table 14 Data on Y.4 Measurement knowledge on the application level


1 A 0.43 0.61 0.182 A 0.16 0.17 0.013 B 0.9 0.79 −0.114 B 0.65 0.8 0.155 B 0.39 0.56 0.176 A 0.59 0.56 −0.037 A 0.34 0.4 0.068 B 0.67 0.55 −0.129 A 0.79 0.76 −0.0310 B 0.52 0.66 0.1411 B 0.72 0.81 0.0912 A 0.33 0.37 0.0413 A 0.37 0.56 0.1914 A 0.56 0.55 −0.0115 B 0.63 0.46 −0.17


Table 16 Data on Y.6 Appropriateness

Participant Content Y.6.6Teachingmethod

Y.6.7Duration

Y.6.1Relevancy

Y.6.2Correctness

Y.6.3Sufficiency

Y.6.4Difficulty

Y.6.5Sequence

1 4 2 3 2 4 2 42 3 3 3 3 3 3 26 3 3 3 4 4 3 37 4 3 2 4 4 2 49 4 4 3 3 3 4 312 3 3 3 2 4 3 213 4 4 4 3 3 3 314 4 3 4 3 4 4 4Median 4.0 3.0 3.0 3.0 4.0 3.0 3.0Interquartile range 1.0 0.5 0.5 1.0 1.0 1.0 1.5

Participant Y.5.1 Learningeffect on measurementconcepts and process

Y.5.2 Learningeffect on measurementapplication

1 2 22 3 36 3 37 2 29 3 412 3 313 3 314 3 3Median 3.0 3.0Interquartile range 0.5 0.5

Table 15 Data on Y.5 Subjectivelearning effect

Participant Y.7.1 Satisfaction Y.6.2 Fun

1 2 22 3 26 4 37 3 29 4 312 3 213 4 214 3 2Median 3.0 1.0Interquartile range 2.0 0.5

Table 17 Data on Y.7Engagement


Table 18 Data on DF1. Personal background

Participant Formation Training Experience Total score

Group A1 2 0 1 32 1 0 0 16 1 0 1 27 1 0 0 19 3 4 6 1312 0 1 1 213 1 1 5 714 1 2 2 5Median 2.5Interquartile range 4.5Group B3 2 1 5 84 1 0 0 15 1 0 1 28 3 2 2 710 1 1 1 311 1 0 2 315 2 0 1 3Median 3.0Interquartile range 5.0

Participant DF.2.1 Importance DF.2.2 Interest

Group A1 3 32 3 36 4 47 3 49 4 412 4 413 4 414 3 3Median 3.5 4.0Interquartile range 1.0 1.0Group B3 4 44 3 35 3 38 4 410 4 411 4 415 3 4Median 4.0 4.0Interquartile range 1.0 1.0

Table 19 Data on DF.2Motivation


References

Abt CC (2002) Serious games. University Press of America, Lanham, MDAkili GK (2007) Games and Simulations: A new approach in education. In: Gibson D, Aldrich C, Prensky M

(eds) Games and simulations in online learning: research and development frameworks. InformationScience Publishing, Hershey/PA, pp. 1–20

Almstrum VL, Dale N, Berglund A, Granger M, Currie Little J, Miller DM et al (1996) Evaluation: turningtechnology from toy to tool. Proceedings of the 1st Conference on Integrating Technology intoComputer Science Education (ITiCSE '96). Barcelona, Spain, pp 201–217

Anderson LW, Krathwohl DR (eds) (2001) A taxonomy for learning, teaching, and assessing: a revision ofbloom's taxonomy of educational objectives. Longman, New York

Baker A, Oh Navarro E, van der Hoek A (2003) Problems and programmers: an educational softwareengineering card game. Proceedings of the 2003 International Conference on Software Engineering.Portland, Oregon, pp 614–619

Basili VR, Caldiera G, Rombach HD (1994) Goal/question/metric approach. In: Marciniak J (ed)Encyclopedia of software engineering, vol. 1. John Wiley & Sons, New York, pp. 528–532

Bastien JMC, Scapin D (1993) Ergonomic criteria for the evaluation of human-computer interfaces.Technical report no. 156. Institut National de Recherche en Informatique et en Automatique, France

Bruns B, Gajewski P (1999) Multimediales Lernen im Netz: Leitfaden für Entscheider und Planer. Springer,Berlin (in German)

Buglione L (2007) Project-o-poly. Giocare per Apprendere. Il gioco come opportunità nelle LearningOrganizations. Persone & Conoscenze, Jan/Feb 2007, No.26/27, ESTE, pp 43–47 (in Italian)

Choi J, Hannafin M (1995) Situated cognition and learning environments: roles, structures and implicationsfor design. Educ Technol Res Dev 43(2):53–69 doi:10.1007/BF02300472

CMMI Product Team (2006) CMMI for development, version 1.2. Technical report CMU/SEI-2006-TR-008.Software Engineering Institute/Carnegie Mellon University, Pittsburgh, Pennsylvania

Collofello JS (2000) University/industry collaboration in developing a simulation based software projectmanagement training course. Proceedings of the 13th Conference on Software Engineering Educationand Training, Austin, Texas, pp 161–168

Participant DF.3pre DF.3post

Group A1 0.2 0.42 0.8 0.46 0.6 07 0.2 09 0.6 0.412 0.4 0.213 0.4 014 0.2 0Median 0.4 0.2Interquartile range 0.4 0.4Group B3 0.18 04 0.1 0.65 0.2 08 0.18 010 0.1 0.411 0.2 015 0.18 0.2Median 0.18 0.0Interquartile range 0.1 0.4

Table 20 Data on DF3. Addi-tional study time


http://dx.doi.org/10.1007/BF02300472

Dantas A, Barros M, Werner C (2004) A simulation-based game for project management experientiallearning. Proceedings of the 16th International Conference on Software Engineering & KnowledgeEngineering (SEKE'2004), Banff, Canada, pp 19–24

Dekkers CA, McQuaid PA (2002) The dangers of using software metrics to (Mis)Manage. IEEE ITProfessional, IEEE Computer Society, March/April 2002

Drappa A, Ludewig J (2000) Simulation in software engineering training. Proceedings of the 22ndInternational Conference on Software Engineering. Limerick, Ireland, pp 199–208

Ellington H, Addinall E, Percival F (1982) A handbook of game design. Kogan Page, LondonGreitzer FL, Kuchar OA, Huston K (2007) Cognitive science implications for enhancing training effect-

iveness in a serious gaming context. J Educ Resour Comput 7(3) Art. 2, ACM/New YorkHock GT, Hui GLS (2004) A study of the problems and challenges of applying software metrics in software

development industry. Proceedings of the M2USIC-MMU International Symposium on Information andCommunication Technologies, Putrajaya, Malaysia

ISO 9241-151 (2008) Ergonomics of human-system interaction—part 151: guidance on world wide web userinterfaces. International Organization for Standardization, Geneva

Jain A, Boehm B (2006) SimVBSE: developing a game for value-based software engineering. Proceedingsof the 19th Conference on Software Engineering Education and Training. Turtle Bay, Hawaii, pp 103–111

Janzen DS, Turner CS, Saiedian H (2007) Empirical software engineering in industry short courses.Proceedings of the 20th Conference on Software Engineering Education & Training (CSEET), Dublin,Ireland, pp 89–96

Kafai YB (2001) The educational potential of electronic games: from games-to-teach to games-to-learn. Conference on playing by the rules: the cultural policy challenges of video games. Chicago,Illinois

Kasunic M (2006) The state of software measurement practice: results of 2006 survey. Technical reportCMU/SEI-2006-TR-009, Carnegie Mellon University/Software Engineering Institute, Pittsburgh,Pennsylvania

Kirkpatrick DL, Kirkpatrick JD (2006) Evaluating training programs: the four levels, 3rd edn. Berrett-Koehler Publishers, San Francisco, 379 pp

Levin J, Fox J (2006) Elementary statistics in social research. Allyn & Bacon, Boston, 384 ppLino JI (2007) Proposal of an educational game for software measurement and analysis. Project Thesis,

Undergraduate Course on Information System, Federal University of Santa Catarina, Brazil (in BrazilianPortuguese)

Lipsey M (1990) Design sensitivity. Sage, CaliforniaLöper S, Zehle M (2003) Evaluation of software metrics in the design phase and their implication on CASE

tools. Master Thesis, Blekinge Institute of Technology, SwedenMalone TW (1982) Heuristics for designing enjoyable user interfaces: Lessons from computer

games. Proceedings of the Conference on Human factors in computing systems, Gaithersburg,Maryland

McGarry J, Card D, Jones C, Layman B, Clark E, Dean J et al (2001) Practical software measurement:objective information for decision makers. Addison-Wesley Professional, Reading

McNabb M, Hawkes M, Rouk U (1999) Critical issues in evaluating the effectiveness of technologyconference summary. National Conference on Educational Technology, Washington, D.C.

Michael D, Chen S (2006) Serious games: games that educate, train, and inform. Thomson CourseTechnology, Boston

Molenda M, Pershing JA, Reigeluth CM (1996) Designing instructional systems. In: Craig RL (ed) TheASTD training and development handbook. 4th edn. McGraw-Hill, New York, pp 266–293

Oh Navarro E, van der Hoek A (2007) Comprehensive evaluation of an educational software engineeringsimulation environment. Proceedings of the 20th Conference on Software Engineering Education andTraining, Dublin, Ireland, pp 195–202

Ott LM (2005) Developing healthy skepticism not disbelief-problems in teaching software metrics.Proceedings of the 1st Workshop on Methods for Learning Metrics at the 11th IEEE Software MetricsSymposium. Como, Italy

Percival F, Ellington H, Race P (1993) Handbook of educational technology, 3rd edn. Kogan Page,London

Pfahl D, Laitenberger O, Dorsch J, Ruhe G (2003) An externally replicated experiment for evaluating thelearning effectiveness of using simulations in software project management education. Empiricalsoftware engineering, v. 8. Kluwer Academic, The Netherlands, pp 367–395

Prensky M (2001) Digital game-based learning. McGraw-Hill, New York


Sharp H, Hall P (2000) An interactive multimedia software house simulation for postgraduate softwareengineers. Proceedings of the 22nd International Conference on Software Engineering. Limerick,Ireland, pp 688–691

Takona JP (2002) Educational research: principles and practice. Writers Club Press, New York, 604 ppThomas R (1996) A practical experiment in teaching software engineering metrics. Proceedings of the Int.

Conference on Software Engineering: Education & Practice. Otago, New Zealand, pp 226–232Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2000) Experimentation in software

engineering—an introduction. Kluwer Academic, Norwell

Christiane Gresse von Wangenheim is a Professor at the Universidade do Vale do Itajaí (UNIVALI) and aConsultant at Incremental Tecnologia. Her research interests are software process improvement, including projectmanagement and software measurement. Previously, she worked at the Fraunhofer Institute for ExperimentalSoftware Engineering. She received a Ph.D. degree in Production Engineering at the Federal University of SantaCatarina (Brazil) and a Ph.D. degree in Computer Science at the University of Kaiserslautern (Germany). She’salso a PMP - Project Management Professional and Assessor of the Brazilian Process Improvement Model MPS.BR. She is a member of the IEEE Computer Society, the Brazilian Computer Society, the Project ManagementInstitute, and theWorkingGroup ISO/IEC JTC1/SC7/WG24—SE Life-Cycle Profiles for Very Small Enterprises.

Marcello Thiry is a Professor at the Universidade do Vale do Itajaí (UNIVALI) since 1993 and a consultant atIncremental Tecnologia. His research interests are software process improvement, including requirementsengineering and project management. He received a Ph.D. degree in Production Engineering at the Federal


University of Santa Catarina (Brazil). He is also a PMP - Project Management Professional and Lead Assessor ofthe Brazilian Process Improvement Model (MPS.BR). He is a member of the IEEE Computer Society, theBrazilian Computer Society and of the Project Management Institute.

Djone Kochanski is a Consultant at Datainfo Consultoria. His research interest are software processimprovement. He received his BSc in Computer Science from Universidade Regional de Blumenau (FURB)and is a M.Sc. student at the Master Program on Applied Computer Science at the Universidade do Vale do Itajaí(UNIVALI)/Brazil. He is a member of the Project Management Institute.


empirical evaluation of an educational game on software ...€¦ · empirical evaluation of an...

Documents