what vs. how: comparing students’ testing and coding skills

What vs. How: Comparing Students’Testing and Coding Skills

Colin Fidge1 Jim Hogan1 Ray Lister2

1School of Electrical Engineering and Computer Science,Queensland University of Technology, Brisbane, Qld, Australia

2Faculty of Engineering and Information Technology,University of Technology Sydney, Sydney, NSW, Australia

{c.fidge, j.hogan}@qut.edu.au, [email protected]

Abstract

The well-known difficulties students exhibit whenlearning to program are often characterised as eitherdifficulties in understanding the problem to be solvedor difficulties in devising and coding a computationalsolution. It would therefore be helpful to understandwhich of these gives students the greatest trouble.Unit testing is a mainstay of large-scale software de-velopment and maintenance. A unit test suite servesnot only for acceptance testing, but is also a formof requirements specification, as exemplified by ag-ile programming methodologies in which the tests aredeveloped before the corresponding program code. Inorder to better understand students’ conceptual dif-ficulties with programming, we conducted a series ofexperiments in which students were required to writeboth unit tests and program code for non-trivial prob-lems. Their code and tests were then assessed sep-arately for correctness and ‘coverage’, respectively.The results allowed us to directly compare students’abilities to characterise a computational problem, as aunit test suite, and develop a corresponding solution,as executable code. Since understanding a problemis a pre-requisite to solving it, we expected students’unit testing skills to be a strong predictor of theirability to successfully implement the correspondingprogram. Instead, however, we found that students’testing abilities lag well behind their coding skills.

Keywords: Learning to program; unit testing; object-oriented programming; program specification

1 Introduction

Failure and attrition rates in tertiary programmingsubjects are notoriously high. Many reasons havebeen suggested for poor performance on program-ming assignments, including an inability to fully un-derstand the problem to be solved and an inabilityto express a solution in the target programming lan-guage.

To help determine whether students do poorly onprogramming tasks due to an incomplete understand-ing of the problem or an inability to write a solu-tion, we conducted a series of experiments in whichthese two aspects of programming were evaluated sep-arately. Students in second and third-year program-ming classes were given assessable assignments whichrequired them to:

Copyright c©2013, Australian Computer Society, Inc. Thispaper appeared at the Fifteenth Australasian Computing Ed-ucation Conference (ACE 2013), Adelaide, South Australia,January–February 2013. Conferences in Research and Practicein Information Technology (CRPIT), Vol. 136, Angela Carboneand Jacqueline Whalley, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included.

1. unambiguously and completely characterise theproblem to be solved, expressing the require-ments as a unit test suite; and then

2. produce a fully-functional computational solu-tion to the problem, expressed as an object-oriented program.

Both parts of the assignments carried equal weight.By directly comparing how well each student per-formed on each of these tasks we aimed to see if therewas a clear relationship between students’ ability tosay what must be done versus their ability to say howto do it.

Since understanding a problem is a necessary pre-requisite to solving it, our expectation was that stu-dents would inevitably need to do well on problemdefinition before succeeding in coding a solution. Wetherefore designed assignments which allowed us toproduce independent marks for code functionality andtest coverage, enabling students’ skills in these twodistinct stages of large-scale object-oriented program-ming to be contrasted.

2 Related and previous work

Our overall goal is to help expose the underlyingreasons for students’ difficulties in learning how todevelop program code. In particular, we aimed todistinguish students’ abilities to both describe andsolve the same computational problem, using ‘test-first’ programming as a basis. We did this with classesof students who had already completed two previousprogramming subjects, so they had advanced beyondthe basic problems of developing imperative programcode. As a clearly-defined test-first programmingparadigm we used ‘test-driven development’ (Beck2003), in which unit tests are developed first and theprogram code is then extended and refactored in or-der to pass the tests.

The academic role of unit testing in general, andtest-first programming in particular, has already re-ceived considerable attention. For instance, an earlystudy by Barriocanal et al. (2002) attempted to in-tegrate unit testing into a first-year programmingsubject. They made unit testing optional to assessits take-up rate and found that only around 10% ofstudents voluntarily adopted the approach (althoughthose who did so reported that they liked it).

In another early study, Edwards (2004) advocatedthe introduction of testing into student assignmentsas a way of preventing them from following an ad hoc,trial-and-error coding process. He describes a mark-ing tool called Web-CAT which assesses both programcode correctness and unit test coverage. (We do thesame thing, but developed our own Unix scripts forthis purpose.) However, where we kept students’ codecorrectness and test coverage marks separate, so that

Proceedings of the Fifteenth Australasian Computing Education Conference (ACE2013), Adelaide, Australia

97

we could perform a correlation analysis on them, Ed-wards (2004) combined the marks to produce a singlecomposite score for return to the students.

In a more recent study of unit testing for first-yearstudents, Whalley & Philpott (2011) assessed the ad-vantages of supplying unit tests to students. (We dolikewise in our first-year programming subject, givingstudents unit tests but not expecting them to writetheir own.) Using a questionnaire they concluded thatalthough it was generally beneficial for students to ap-ply the ‘test-early’ principle, a minority of studentsstill struggle to understand the concept.

In an earlier study, Melnick & Maurer (2005)surveyed students’ perceptions of agile programmingpractices, and found that students failed to see thebenefits of testing and believed that it requires toomuch work. In our experience, there is no doubtthat test-first programming requires considerable dis-cipline from the programmer and can be more time-consuming than ‘test-last’ programming. Neverthe-less, we have found that this extra effort is compen-sated for by a better quality product, as have manyother academics (Desai et al. 2008).

Given students’ reluctance to put effort into unittesting, Spacco & Pugh (2006) argued that testingmust be taught throughout the curriculum and thatstudents must be encouraged to “test early”. Like usthey did this by assessing students’ test coverage andby keeping some unit tests used in marking from thestudents. (In fact, in our experiments we did not pro-vide the students with any unit tests at all. Insteadwe gave them an Application Programming Interfaceto satisfy, so that they had to develop all the unittests themselves.)

Keefe et al. (2006) aimed to introduce not justtest-driven development into first-year programming,but also other ‘extreme’ programming concepts suchas pair programming and refactoring. (These prin-ciples are also introduced in our overall curriculum,but not all during first year.) Their survey of stu-dents found a unanimous dislike of test-driven devel-opment, and they concluded that students find it a“difficult” concept to fully appreciate. (In our coursewe introduce students to unit testing in first year, butdon’t get them to write their own test suites until thesecond and third-year subject described herein. Atall levels, however, our experience is also that thereis considerable resistance to the concept from noviceprogrammers.)

Like many other researchers, Janzen & Saiedian(2007) found that mature programmers were morelikely to see the benefits of test-driven development.Given a choice they found that novice programmerspicked test-last programming in general. In a sub-sequent study they also found that students usingtest-first programming produced more unit tests, andhence better test coverage, than those using test-last programming, but that students nevertheless stillneeded to be coerced to follow test-first programming(Janzen & Saiedian 2008). (Both of these findings areentirely consistent with our experiences. Althoughour own academic staff can clearly see the benefits oftest-first programming, our students must be coercedinto adopting it.)

Most recently, using attitudinal surveys, Buffardi& Edwards (2012) again found that students did notreadily accept test-driven development, despite thefact that those students who followed the approachproduced higher quality code. (They report that theytaught test-driven development but did not assess itdirectly. By contrast we directly assessed our stu-dents’ unit testing skills.)

Thus, while there have been many relevant studies,none has presented a direct comparison of unit test-ing and program coding skills as we do below. Fur-

thermore, while we focussed on students with someprogramming experience, most previous studies haveconsidered first-year students only.

3 Programming versus unit testing

In traditional “waterfall” programming methodolo-gies, program code is developed first, followed by a setof acceptance tests. Modern “agile” software develop-ment approaches reverse this sequence (Schuh 2005).Here the tests are written first, in order to define whatthe program code is required to do, and then cor-responding program code is developed which passesthe tests. In the most extreme form, test-driven de-velopment involves iteratively writing individual unittests immediately followed by extending and refac-toring the corresponding program code to pass thenew test (Beck 2003). This is usually done in object-oriented programming, where the “unit” of testing isone or more methods. Advantages claimed for this ap-proach to software development include the fact thata “working” version of the system is available at alltimes (even if all the desired functionality has not yetbeen implemented), that it minimises administrativeoverheads, and that it responds rapidly to changingcustomer requirements.

Most importantly for our purposes, a suite of unittests can be viewed as a specification of what the cor-responding program is required to do. It defines, viaconcrete examples, what output or effect each methodin the program must produce in response to specificinputs. For each method in the program, a well-designed unit test suite will include several represen-tative examples of ‘typical’ cases, which validate themethod’s ‘normal’ functionality, as well as ‘extreme’or ‘boundary’ cases, which confirm the method’s ro-bustness in unusual situations (Schach 2005, Astels2003). Overall, a unit test suite must provide goodcoverage of the various scenarios the individual meth-ods are expected to encounter. (However, unit test-ing does not consider how separate program moduleswork together, so it is usually followed by integrationtesting.)

Test-driven software development thus producestwo distinct artefacts, a unit test suite and corre-sponding program code. The first defines what needsto be done and the second describes how this require-ment is achieved computationally. Having taughtobject-oriented programming and test-driven devel-opment for several years, we realised that these twoartefacts can be assessed separately, giving indepen-dent insights into students’ programming practices.

4 The experiments

To explore the relationship between students’ test-ing (specification) and coding (programming) ability,we conducted a series of four experiments over twosemesters. This was done in the context of a Soft-ware Development subject for second and third-yearIT students. (The classes also contained a handful ofpostgraduate students, but too few to have a signif-icant bearing on the results.) The students enrolledhave typically completed two previous programmingsubjects, meaning that they have already advancedbeyond simple imperative coding skills and are nowconcerned with large-scale, object-oriented program-ming.

The Software Development subject focuses ontools and techniques for large-scale program develop-ment and long-term code maintenance. Topics cov-ered include version control, interfacing to databases,software metrics, Application Programming Inter-faces, refactoring, automated builds, etc. In particu-

CRPIT Volume 136 - Computing Education 2013

98

Figure 1: User interface for the solution to Assign-ment 2a

lar, the role of unit testing is introduced early and isused throughout the semester. Test-driven develop-ment is also introduced early as a consistent method-ology for creating unit test suites. The illustrativeprogramming language for the subject is Java, al-though the concepts of interest are not Java-specific.

Each semester’s assessment includes two non-trivial programming assignments. Both involve de-veloping a unit test suite and a corresponding object-oriented program, each worth approximately equalmarks. The first assignment is individual and thesecond is larger and conducted in pairs. In the firstassignment the students are given a front-end Graph-ical User Interface and must develop the back-endclasses needed to support it. In the second assign-ment the student pairs are required to develop boththe GUI and the back-end code.

For example, one of the individual assignments in-volved developing classes to complete the implemen-tation of a ‘Dam Simulator’ which models the ac-tions involved in controlling water levels in a dam.(This topical assignment was introduced shortly af-ter the January 2011 Brisbane floods, which wereexacerbated by the overflow of the Wivenhoe dam.)The simulator models the effects on water levels ofrandomly-generated inflows and user-controlled out-flows over a period of time. The user acts as thedam’s operator and can choose how much water torelease into the downstream spillway each day. Thesimulation ends if the dam overflows or runs dry.

(This assignment, and the ‘Warehouse Simulator’described below, are examples of ‘optimal stopping’problems, in which the challenge is to optimise thevalue of a certain variable, such as a dam’s waterlevel, despite one or more influencing factors, suchas rainfall, being out of the user’s control (Ferguson2010). We find that these problems make excellentassignment topics because they can form the basis

Figure 2: Example Javadoc specification of a methodto be implemented in Assignment 2a

of game-like simulations which are popular with thestudents.)

The students were given the code for the user in-terface (Figure 1), minus the back-end calculations.(As per civil engineering convention, the dam’s ‘nor-mal’ water level is half of its capacity, which is whythe meters in Figure 1 go to 200%.) To complete thesimulator they were required to develop two classesand their corresponding unit tests. One class is usedto keep a daily log of water levels in the dam and theother implements the effects of the user’s inputs.

The necessary classes and methods to be imple-mented were defined via Java ‘interface’ classes, de-scribed in standard ‘Javadoc’ style. For instance,Figure 2 contains an extract from this specification,showing the requirements for one of the Java methodsthat must be implemented for the assignment. In thiscase the method adds a new entry to the log of dailywater levels. It must check the validity of its givenarguments, add the new entry to the (finite) log, andkeep track of the number of log entries made to date.Our anticipated solution is the Java method shown inFigure 3.

Each of these Java functions must be accompaniedby a corresponding suite of ‘JUnit’ tests (Link 2003),to ensure that the method has been implemented cor-rectly and to document its required functionality. Forinstance, Figure 4 shows one such unit test which con-firms that this method correctly maintains the num-ber of log entries made to date. The test does this byinstantiating a new log object, adding a fixed num-ber of entries into it, and asserting at each step thatthe number of entries added to the log so far equalsthe number of entries reported by the log itself. Ingeneral there will be several such unit tests for eachmethod implemented—it is typically the case that acomprehensive unit test suite will be much larger thanthe program code itself. In our own solution we hadseven distinct unit tests like the one in Figure 4 tofully define the required properties of the method inFigure 3.

Since the deliverables for these assignments in-cluded both unit tests and code, we determined toexploit the opportunity to directly compare how wellstudents performed on unit testing and coding. Weconsidered only those parts of the assignments whereboth unit tests and code must be produced. (TheGUI code in the two pair-programming assignments


99

Figure 3: Example of a Java method to be implemented in Assignment 2a

Figure 4: Example of a JUnit test for the method in Figure 3

Table 1: Statistics for the four assignmentsAssignment

1a 1b 2a 2b

Number of classes toimplement (excludingGUIs)

2 10 2 6

Number of methods toimplement (excludingGUIs)

16 25 12 26

Number of unit tests in our‘ideal’ solution 66 63 45 53

Number of assignmentssubmitted and marked 170 83 217 113

was not accompanied by unit tests and was markedmanually. Nor did we consider marks awarded for‘code quality’ in the experiment.) At the time theexperiment was conducted we had four assignments’worth of results to analyse, two from each semester(Table 1). In all four cases the students were requiredto implement both program code and unit tests andit was emphasised that these two parts were worthequal marks.

• Assignment 1a: First semester, individual as-signment. This assignment involved completingthe back-end code for a ‘Warehouse Simulator’which models stock levels in a warehouse. AGraphical User Interface was supplied and stu-dents needed to complete a ‘ledger’ class, to trackexpenditure, and a ‘transactions’ class, to imple-ment user-controlled stock buying and randomly-generated customer supply actions.

• Assignment 1b: First semester, paired assign-

ment. This assignment involved developing theGUI and back-end code for a ‘Container ShipManagement System’ which models loading andunloading of cargo containers on the deck of aship, subject to certain safety and capacity con-straints. Classes were needed for different con-tainer types (refrigerated, dry goods, hazardous,etc) and for maintaining the ship’s manifest.(There were a large number of classes and meth-ods in this assignment, but most were trivial sub-classes just containing a constructor and one ortwo additional methods.)

• Assignment 2a: Second semester, individualassignment. This was the ‘Dam Simulator’ as-signment described above.

• Assignment 2b: Second semester, paired as-signment. This assignment involved developingthe GUI and back-end code for a ‘DepartingTrain Management System’ which modelled theassembly and boarding of a long-distance pas-senger train. Classes were needed to model theshunting of individual items of rolling stock (in-cluding an engine, passenger cars, freight cars,etc) to assemble a train and then simulate board-ing of passengers.

In each case the students were expected to follow thetest-driven development discipline. For each func-tional requirement they were expected to develop aJUnit test (such as that in Figure 4) and then extendand refactor their Java program code (like that in Fig-ure 3) until all the system requirements were satisfied.During individual Assignments 1a and 2a the lone stu-dent was expected to alternate roles as ‘tester’ and‘coder’. Pair-programming Assignments 1b and 2ballowed each student to adopt one of these roles at atime.

Unix shell test scripts were developed to auto-matically mark the submitted assignments. This was


100

done in two stages, to separately assess the students’program code and unit test suites.

• Code functionality: The students’ classes werecompiled together with our own ‘ideal’ unit testsuite. (As explained below, this often exposedstudents’ failures to match the specified API.)Our unit tests were then executed to determinehow well the students had implemented the re-quired functionality in their program code. Theproportion of tests passed was used to calculate a‘code functionality’ mark and a report was gener-ated automatically for feedback to the students.

• Test coverage: For each of our own unit testswe developed a corresponding ‘broken’ programwhich exhibited the flaw being tested for. Toassess the students’ unit test suite against theseprograms, their tests were first applied to ourown ‘ideal’ solution program to provide a bench-mark for the number of tests passed on a correctsolution. Then the students’ unit tests were ap-plied to each of our broken programs. If fewertests were passed than the benchmark our mark-ing script interpreted this to mean that the stu-dents’ unit tests had detected the bug in the pro-gram. (This process is not infallible since it can’ttell which of the students’ tests failed. Neverthe-less, we have found over several semesters that itgives a good, broad assessment of the quality ofthe students’ unit test suites.) The proportion ofbugs found was used to calculate a ‘test cover-age’ mark and a feedback report was generatedautomatically.

These marking scripts needed to be quite elaborateto cater for the complexity of the assignments and toallow for various problems caused by students failingto follow the assignments’ instructions. The mark-ing scripts ultimately consisted of well over 400 linesof (commented) Unix bash code (excluding our ownideal solution and the broken programs needed to as-sess the students’ tests).

A particularly exasperating problem encounteredduring the marking process was the failure of a largeproportion of students to accurately implement thespecified Application Programming Interface and fileformats, typically due to misspelling the names ofclasses and methods, failing to throw required excep-tions, adding unexpected public attributes, or usingthe wrong types for numeric parameters. This wasdespite the teaching staff repeatedly emphasising theneed to precisely match the specified API as an im-portant aspect of professional software development.In particular, during marking of Assignment 2a itwas found that fully half of the submissions failedto match the API specification, and therefore couldnot be compiled and assessed. In order to salvagesome marks for these defective assignments, thosethat could be easily corrected by, for example, chang-ing the class and method signatures, were fixed man-ually. In the end 97 assignments, which accounted for45% of all submissions, were corrected manually andre-marked. Penalities were applied proportional tothe extent of correction needed. Ultimately, however,this large amount of effort produced little differencesince the re-marking penalties sometimes outweighedthe additional marks gained!

Another practical issue noted by Spacco & Pugh(2006), and confirmed by our own experiences, is thedifficulty of developing unit tests that uniquely iden-tify a program bug. In practice a single program bugis likely to cause multiple tests to fail. We found whensetting up our automatic marking environment, in-cluding our own unit test suite (for assessing the stu-dents’ code) and the suite of ‘broken’ programs (for

Table 2: Measures of central tendency, spread andcorrelation for code functionality and test coverage

Assignment1a 1b 2a 2b

Code functionalitymean 82 84 79 91

Code functionalitymedian 86 93 86 96

Code functionalitystandard deviation 20 20 21 12

Test coverage mean 69 79 45 63

Test coverage median 67 92 52 69

Test coveragestandard deviation 23 25 29 24

Functionality versuscoverage correlation 0.64 0.76 0.70 0.55

assessing the students’ unit tests), that it was impos-sible to achieve a precise one-to-one correspondencebetween code bugs and unit tests. While frustratingfor us, this did not invalidate the marking process,however.

Moreover, one of the risks associated with this kindof study is threats to ‘construct validity’ (Arisholm &Sjøberg 2004), i.e., the danger that the outcomes aresensitive to different choices of code functionality andtest coverage measures. Nevertheless, we believe ouranalysis is quite robust in this regard because eachof the individual tests in our ‘ideal’ unit test suitewas directly developed from a specific functional re-quirement clearly described in the Javadoc API spec-ifications, and each of the broken programs was in-troduced to ensure that a particular unit test wasexercised. This very close functional relationship be-tween requirements, code features and unit tests leftlittle scope for arbitrary choices of unit tests or bro-ken programs against which to assess the students’assignments.

5 Experimental results

The marks awarded for code functionality and testcoverage were normalised to percentages and are sum-marised in Table 2. (These values exclude marksawarded for ‘code presentation’ and for the front-end GUIs in Assignments 1b and 2b.) It is obviousthat the average marks for test coverage are well be-low those for code functionality, which immediatelycasts doubt on our assumption that students’ unittesting abilities would be comparable to their pro-gramming skills. Furthermore, while a standard cor-relation measure (last row of Table 2) shows somecorrelation between the two sets of marks, it is notvery strong.

Further insight can be gained by plotting both setsof marks together as in Figures 5 to 8. Here we havesorted the pairs of marks firstly by code functional-ity, in blue, followed by test coverage, in red. As-signments 1a and 1b used a coarser marking schemethan was used for Assignments 2a and 2b, thereby ac-counting for their charts’ ‘chunkier’ appearance, butthe overall pattern is essentially the same in all fourcases. At each level of achievement for code function-ality there is a wide range of results for test coverage.This pattern is especially pronounced in Figure 7.


101

Figure 5: Comparison of marks for code functionality and test coverage, Assignment 1a, 170 submissions

Figure 6: Comparison of marks for code functionality and test coverage, Assignment 1b, 83 submissions

Recall that Assignments 1a and 2a were individ-ual, so we can assume that the same person developedboth the unit tests and program code. How pairs ofstudents divided their workload in Assignments 1band 2b is difficult to say based purely on the sub-mitted artefacts, although if they followed the assign-ments’ instructions they would have swapped ‘tester’and ‘coder’ roles regularly. Figures 6 and 8 are forthe pair programming assignments and both show animprovement in the test coverage marks, comparedto the preceding individual assignments in Figures 5and 7. Regardless of how they divided up the task,this suggests that the students took the unit testingparts of the assignment more seriously in their sec-ond assessment. Nevertheless, the test coverage re-sults still lag well behind those for code functionalitythroughout.

Inspection of the submitted assignments suggeststhat many students didn’t follow the test-driven de-velopment process which, in practice, can be timeconsuming and requires considerable discipline. Of-ten students developed their program code and theirunit tests separately, rather than letting the lattermotivate the former. However, even if students didnot follow the test-driven development process, thisdoes not invalidate our comparision of their unit test-ing and program coding abilities, as these are two

distinct skills.In all four sets of results there are numerous ex-

amples of students scoring well for code functionalitybut very poorly for test coverage, meaning that theycould implement a solution to a problem that theycouldn’t (or merely chose not to) characterise in theform of a unit test suite, the exact opposite of whatwe would expect if they had strictly followed the test-driven development process.

To explore this phenomenon further we conducteda conventional correlation analysis (Griffiths et al.1998) to see if there was a clear relationship be-tween students’ testing and coding skills, irrespectiveof their absolute marks for each. We began by check-ing the normality of all the marks’ distributions. Thiswas done by inspecting quantile-quantile plots gener-ated by the qqnorm function from the statistics pack-age R (R Core Team 2012). These plots (not shown)provide good evidence of normality for seven of theeight data sets, albeit with some overrepresentationof high scores in the coding result distributions. Thequantile-quantile plots were skewed somewhat by dis-cretisation of the marks; evidence for normality wasespecially strong for Assignments 2a and 2b, for whichwe had the most fine-grained marks available.

The marks were then used to create correlationscatterplots (Griffiths et al. 1998), comparing stan-


102

Figure 7: Comparison of marks for code functionality and test coverage, Assignment 2a, 217 submissions

Figure 8: Comparison of marks for code functionality and test coverage, Assignment 2b, 113 submissions

dardised marks for each of the four assignments, asshown in Figures 9 to 12. As usual, dots appearingin the top-right and bottom-left quadrants suggestthat there is a positive linear relationship betweenthe variables of interest, in this case students’ codefunctionality and test coverage results. (In the inter-ests of clarity a handful of extreme outliers, result-ing from some students receiving near-zero marks forboth tests and code, have been omitted from the fig-ures.)

All four figures exhibit good evidence of a linear re-lationship, especially due to the numbers of studentswho did well on both measures (top-right quadrant).Overall, though, the pattern is not as strong as weexpected.

Undoubtedly many students put less effort into theunit tests, despite them being worth equal marks tothe program code. It is clear from the averages inTable 2 and the charts in Figures 7 and 8 that eventhe best students did not do as well on the unit testsas the program code, and it certainly wasn’t the casethat successfully defining the test cases to be passedwas a pre-requisite to implementing a solution as wehad originally assumed. Even more obvious is theline of dots along the bottom of Figure 11 which wascaused by students who received zero marks for testcoverage. Evidently these students learned the im-

portance of the unit tests for their overall grade bythe time they did their second assignment because nosuch pattern is evident in Figure 12.

Overall, therefore, contrary to our expectations,computer programming students’ performance at ex-pressing what needs to be done proved to be a poorpredictor of their ability to define how to do it.

6 Conclusion

The students’ poor performance in unit testing com-pared to program coding surprised us. Nonetheless,the pattern of results in Figures 5 to 8 is remarkablyconsistent. On the left of each chart are a few exam-ples of students who could characterise the problem tobe solved but couldn’t complete a solution (i.e., theirtest coverage results in red are better than their codefunctionality results in blue), which is what we wouldexpect to see if the students applied test-driven de-velopment. However, this was far outweighed in eachexperiment by the dominance of results in which stu-dents produced a good program but achieved poortest coverage (i.e., their blue code functionality re-sults were better than their red test coverage results).We thus have clear empirical evidence that studentsstruggle to a greater extent defining test suites than


103

Figure 9: Correlation scatterplot, code functionality (x-axis) versus test coverage (y-axis), with circled dotsdenoting multiple data points with the same value, Assignment 1a

Figure 10: Correlation scatterplot, Assignment 1b

they do implementing solutions, for the same pro-gramming problem.

The outstanding question from this study iswhether this unexpected result reflects students’ abil-ities or motivations. From our inspection of the sub-mitted assignments we can make the following obser-vations.

• The results can be explained in part by manystudents’ obvious apathy towards the unit test-ing part of the assignments. Despite their regularexposure to the principles of test-driven develop-ment in class, and despite being well aware thathalf of their marks for the assignment were fortheir tests, it was clear in many cases that stu-dents did not follow the necessary discipline and

instead wrote the program code first, seeing it asmore “important”. Their unit tests were thencompleted hastily just before the assignment’sdeadline. As noted in Section 2 above, thisstudent behaviour has been observed in manyprior studies. For instance, Buffardi & Edwards(2012) found that students procrastinate whenit comes to unit testing, even when a test-firstprogramming paradigm is advocated, which typ-ically leads to poor test coverage in the submittedassignments.

• Another partial explanation is simply that manystudents had poor unit testing skills. Test-drivendevelopment emphasises the construction of alarge number of small and independent tests,


104

Figure 11: Correlation scatterplot, Assignment 2a

Figure 12: Correlation scatterplot, Assignment 2b

each highlighting a distinct requirement. How-ever, inspection of some assignments with lowtesting scores showed that they consisted of asmall number of large and complicated tests,each attempting to do several things at once.This is evidence that the students could not (orchose not to) follow the test-driven developmentdiscipline. Small sets of complex unit tests arecharacteristic of test-last programming and typi-cally produce poor test coverage because severaldistinct coding errors may all cause the same testto fail, without the reason for the failure beingobvious. At the other extreme there were also afew examples of students producing a very largenumber of tests, often over twice as many as inour own solution, but still achieving poor cov-erage because their tests did not check distinct

problems and so many were redundant. Thisundirected, ‘shotgun’ strategy is again evidenceof a failure to apply, or understand, test-drivendevelopment, which avoids redundancy by onlyintroducing tests that expose new bugs.

• It is also noteworthy that writing unit tests isa cognitively more abstract activity than writ-ing the corresponding program code, thus mak-ing it more challenging for students still com-ing to grips with the fundamentals of program-ming. Whether or not this was the cause of stu-dents’ poor test coverage marks is impossible totell from the submitted artefacts alone. Anecdo-tally, our discussions with students while theywere working on the assignments left us withthe impression that they understood the prin-


105

ciples of unit testing well enough. By far themost common question asked by students whilethey were working on their assignment was “Howmany unit tests should I produce?” rather than“How do I write a unit test?”

Ultimately, therefore, further research is still re-quired. Although we have demonstrated that pro-gramming students’ testing and coding skills can beanalysed separately, and that they consistently per-form better at coding than testing, more work is re-quired to conclusively explain why this is so.

Acknowledgements

We wish to thank Dr Andrew Craik for developingthe assignment marking scripts used in the first twoexperiments, and the anonymous reviewers for theirmany helpful suggestions for improving the correla-tion analysis. Support for this project was providedby the Office of Learning and Teaching, an initiativeof the Australian Government Department of Indus-try, Innovation, Science, Research and Tertiary Edu-cation. The views expressed in this publication do notnecessarily reflect the views of the Office of Learningand Teaching or the Australian Government.

References

Arisholm, E. & Sjøberg, D. (2004), ‘Evaluating theeffect of a delegated versus centralized controlstyle on the maintainability of object-oriented soft-ware’, IEEE Transactions on Software Engineering30(8), 521–534.

Astels, D. (2003), Test-Driven Development: A Prac-tical Guide, Prentice-Hall.

Barriocanal, E., Urban, M.-A., Cuevas, I. & Perez,P. (2002), ‘An experience in integrating automatedunit testing practices in an introductory program-ming course’, SIGCSE Bulletin 34(4), 125–128.

Beck, K. (2003), Test-Driven Development: By Ex-ample, Addison-Wesley.

Buffardi, K. & Edwards, S. (2012), Exploring influ-ences on student adherence to test-driven develop-ment, in T. Lapidot, J. Gal-Ezer, M. Caspersen &O. Hazzan, eds, ‘Proceedings of the SeventeenthConference on Innovation Technology in ComputerScience Education (ITiCSE’12), Israel, July 3–5’,ACM, pp. 105–110.

Desai, C., Janzen, D. & Savage, K. (2008), ‘A sur-vey of evidence for test-driven development inacademia’, SIGCSE Bulletin 40(2), 97–101.

Edwards, S. (2004), Using software testing tomove students from trial-and-error to reflection-in-action, in D. Joyce, D. Knox, W. Dann &T. Naps, eds, ‘Proceedings of the 35th SIGCSETechnical Symposium on Computer Science Educa-tion (SIGCSE’04), USA, March 3–7’, ACM, pp. 26–30.

Ferguson, T. S. (2010), ‘Optimal stopping and ap-plications’. http://www.math.ucla.edu/∼tom/Stopping/Contents.

Griffiths, D., Stirling, W. D. & Weldon, K. L. (1998),Understanding Data: Principles and Practice ofStatistics, Wiley.

Janzen, D. & Saiedian, H. (2007), A leveled exam-ination of test-driven development acceptance, in‘Proceedings of the 29th International Conferenceon Software Engineering (ICSE’07), USA, May 20–26’, IEEE Computer Society, pp. 719–722.

Janzen, D. & Saiedian, H. (2008), Test-driven learn-ing in early programming courses, in S. Fitzger-ald & M. Guzdial, eds, ‘Proceedings of the 39thSIGCSE Technical Symposium on Computer Sci-ence Education (SIGCSE’08), USA, March 12–15’,ACM, pp. 532–536.

Keefe, K., Sheard, J. & Dick, M. (2006), AdoptingXP practices for teaching object oriented program-ming, in D. Tolhurst & S. Mann, eds, ‘Proceedingsof the Eighth Australasian Computing EducationConference (ACE2006), Hobart’, Vol. 52 of Confer-ences in Research in Practice in Information Tech-nology, Australian Computer Society, pp. 91–100.

Link, J. (2003), Unit Testing in Java, Morgan Kauf-mann.

Melnick, G. & Maurer, F. (2005), A cross-programinvestigation of students perceptions of agile meth-ods, in ‘Proceedings of the 27th International Con-ference on Software Engineering (ICSE05), USA,May 15-21’, ACM, pp. 481–488.

R Core Team (2012), R: A Language and Environ-ment for Statistical Computing, R Foundation forStatistical Computing, Vienna, Austria. ISBN 3-900051-07-0.URL: http://www.R-project.org

Schach, S. (2005), Object-Oriented and Classical Soft-ware Engineering, McGraw-Hill, USA. Sixth edi-tion.

Schuh, P. (2005), Integrating Agile Development inthe Real World, Thomson.

Spacco, J. & Pugh, W. (2006), Helping studentsappreciate Test-Driven Development (TDD), inW. Cook, R. Biddle & R. Gabriel, eds, ‘Proceed-ings of the 21st Annual ACM SIGPLAN Confer-ence on Object-Oriented Programming, Systems,Languages and Applications (OOPSLA06), USA,October 22-26’, ACM, pp. 907–913.

Whalley, J. & Philpott, A. (2011), A unit testing ap-proach to building novice programmers skills andconfidence, in J. Hamer & M. de Raadt, eds, ‘Pro-ceedings of the Thirteenth Australasian ComputerEducation Conference (ACE 2011), Perth’, Vol. 114of Conferences in Research and Practice in Infor-mation Technology, Australian Computer Society,pp. 113–118.


106

what vs. how: comparing students’ testing and coding skills

Documents