automatically grading programming assignments with web-cat stephen h. edwards virginia tech dept. of...

Automatically Grading Programming Assignments with Web-CAT

Stephen H. EdwardsVirginia TechDept. of Computer [email protected]://web-cat.sourceforge.net/

Stephen Edwards Virginia Tech

My goals today are to …

Explain how requiring students to formulate and test hypotheses about their own code can improve their understanding and performance

Describe our experiences with an alternate grading approach supported by a new tool: Web-CAT

Describe some of the flexibility in Web-CAT for supporting other approaches

Convince you software testing can be an important—and practical—addition to classroom practices


Stephen Edwards Virginia Tech Automatically Grading Programming Assignments with Web-CAT

Students hold onto ineffective techniques

Too often, intro students believe that if their code:

compiles, the errors are mostly gone

runs correctly when I try it once, it is correct

runs on the instructor-provided sample input, it is correct

has a problem, it can be fixed by trial and error


What is reflection-in-action?

For an expert, when the current technique is failing …

Step back and reflect: “I must be missing something”

Re-examine the situation, your solution, and your implicit assumptions about the problem

Leads to guesses (hypotheses) about why the solution isn’t working or why something else will be better

“[Carry] out an experiment which serves to generate both a new understanding of the phenomenon and a change in the situation”


Practicing software testing will help students frame and carry out experiments

The problem: too much focus on synthesis and analysis too early in teaching CS

Need to be able to read and comprehend source code

Envision how a change in the code will result in a change in the behavior

Need explicit, continually reinforced practice in hypothesizing about program behavior and then experimentally verifying their hypotheses


Student comments suggest their current testing practices are often weak

I run them through some simple tests to ensure that it is operating as expected. But for the most part I have always relied on supplied test data

I don’t think about test cases until I am confident my program is 100% working. Of course, it almost never is …

I usually write the whole thing up and then start doing rapid-fire tests of everything I can think of.



A comprehensive strategy is necessary for a culture shift in what students do

Students cannot test their own code

Want a culture shift in student behavior

A single upper-division course would have little impact on practices in other classes

So: Systematically incorporate testing practices across many courses

CS1CS1

CS2CS2

OODesign

OODesign

DataStructDataStruct

TestingPracticesTesting

Practices


Expect students to apply their testing skills all the time in programming assignments

Expect students to test their own work

Empower students by engaging them in the process of assessing their own programs

Require students to demonstrate the correctness of their own work through testing

Do this consistently across many courses


What tools and techniques should I teach?

We want to start with skills that are directly applicable to authentic student-oriented tasks

Don’t want to add bureaucratic busywork to assignments

Without tool support, this is a lost cause!

It is imperative to give students skills they value

… But most textbooks only give a “conceptual” intro to idealized industrial practices, not techniques students can use in their own assignments


Test-driven development is very accessible for students

Also called “test-first coding”

Focuses on thorough unit testing at the level of individual methods/functions

“Write a little test, write a little code”

Tests come first, and describe what is expected, then followed by code, which must be revised until all tests pass

Encourages lots of small (even tiny) iterations

See http://web-cat.sf.net/ for on-line references


Students can apply TDD in assignments and get immediate, useful benefits

Conceptually, easy for students to understand and relate to

Increases confidence in code

Increases understanding of requirements

Preempts “big bang” integration


The problem is devising an effective assessment strategy

Need to assess student performance at testing

Need to give productive feedback

Need to provide rapid turnaround

Cannot afford huge increase in resources required



Conventional automated assessment does not encourage good testing habits

Student uploads program

Program is compiled

Executed against test data

Scored based on output



The conventional approach provides useful benefits that do lead to a cultural change

Fast, precise feedback to students

Chance(s) to improve based on

feedback

Good assessment of behavior

Systematic use resulted in culture

change Automatically Grading Programming Assignments with Web-CAT


But the conventional approach may discourage desired behavior and skills

Focus is on output correctness, first and foremost

“Get it working first, work on commenting, structure, etc. later”

Students not encouraged or rewarded for testing on their own

Students often do less testing



Proper grading and feedback can provide positive incentive for desirable behavior

Decide what behavior to foster Choose a corresponding

scoring/reward system Design feedback approach Use students’ adaptive nature to

drive cultural change



Proper grading and feedback is critical to reinforcing desired behavior

Assess test validity: correctness of student’s tests

Assess test completeness: the “thoroughness” of student’s tests

Assess program correctness: behavior of student’s solution

Multiply scores as percentages


0

25

50

75

100

125

0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Students Ranked by Defect Rate

Bu

gs/

KS

LO

C

0

25

50

75

100

125

0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Students Ranked by Defect Rate

Bu

gs/

KS

LO

CStudents improve their code quality when using Web-CAT

Newly written “untested” code

Commerical-quality code


Submissions Relative to Due Date

0

5

10

15

20

More +9 +8 +7 +6 +5 +4 +3 +2 +1 Due -1 -2

Days Before Due Date

Nu

mb

er o

f S

ub

mis

sio

ns

With Testing Without Testing

Submissions Relative to Due Date

0

5

10

15

20

More +9 +8 +7 +6 +5 +4 +3 +2 +1 Due -1 -2

Days Before Due Date

Nu

mb

er o

f S

ub

mis

sio

ns

With Testing Without Testing

Students start earlier and finish earlier when they use Web-CAT


An evaluation of submitted code indicates students program more effectively

Bold p = .05 significance Without With TDD

Recorded grades 90.2% 96.1%

TA assessment 98.1% 98.2%

Automated grader assessment 76.8% 94.0%

Faults on master test suite 36.7% 24.9%

Projected Defects/KSLOC70 38

(45% less!)

How early was first submission? 2.2 days 4.2 days


After using TDD and Web-CAT, students clearly perceive practical benefits

Agree

Disagree

More helpful at detecting errors than Curator

4.3

Provides excellent support for TDD 4.1

Increases my confidence in correctness 3.9

Increases my confidence when making changes

3.8

Makes me test my solution more thoroughly

3.8

Makes me more systematic in devising tests

3.8

Would like to use, even if not required 3.8 Automatically Grading Programming Assignments with Web-CAT


Student reactions are very positive toward TDD

I am very excited about using TDD.

I agree that TDD can be beneficial and I’m glad we are being required to experiment with it in this course.

If it increases the effectiveness of my programming and decreases the time I spend debugging, then I am all for it.

[Previously,] I had to quit my detailed testing and stick to making the program appear to work with the sample data given every time a deadline drew near. With [TDD], the tests are such an integral part of the project that no time-conserving measure will save me.



We use Web-CAT to automatically process student submissions and check their work

Web application written in 100% pure Java

Deployed as a servlet

Built on Apple’s WebObjects

Uses a large-grained plug-in architecture internally, providing for easily extensible data model, UI, and processing features


Web-CAT’s strengths are targeted at broader use

Security: mini-plug-ins for different authentication schemes, global user permissions, and per-course role-based permissions

Portability: 100% pure Java servlet for Web-CAT engine

Extensibility: Completely language-neutral, process-agnostic approach to grading, via site-wide or instructor-specific grading plug-ins

Manual grading: HTML “web printouts” of student submissions can be directly marked up by course staff to provide feedback


Grading plug-ins are the key to process flexibility and extensibility in Web-CAT

Processing for an assignment consists of a “tool chain” or pipeline of one or more grading plug-ins

The instructor has complete control over which plug-ins appear in the pipeline, in what order, and with what parameters

A simple and flexible, yet powerful way for plug-ins to communicate with Web-CAT, with each other

We have a number of existing plug-ins for Java, C++, Scheme, Prolog, Pascal, Standard ML, …

Instructors can write and upload their own plug-ins

Plug-ins can be written in any language executable on the server (we usually use Perl)


The most well-known plug-in is for grading Java assignments that include student tests

ANT-based build of arbitrary Java projects PMD and Checkstyle static analysis ANT-based execution of student-written JUnit tests Carefully designed Java security policy Clover test coverage instrumentation ANT-based execution of optional instructor

reference tests Unified HTML web printout Highly configurable (PMD rules, Checkstyle rules,

supplemental jar files, supplemental data files, java security policy, point deductions, and lots more)


Web-CAT supports a variety of languages, and its Java plug-in is aimed at software testing

ANT-based build of arbitrary Java projects PMD and Checkstyle static analysis ANT-based execution of student-written JUnit

tests Carefully designed Java security policy Clover test coverage instrumentation ANT-based execution of optional instructor

reference tests Unified HTML web printout Highly configurable (PMD rules, Checkstyle rules,

supplemental jar files, supplemental data files, java security policy, point deductions, and lots more)


Web-CAT provides timely, constructive feedback on how to improve performance

Indicates where code can be improved

Indicates which parts were not tested well enough

Provides as many “revise/ resubmit” cycles as possible


The most important step in writing testable assignments is …

Learning to write tests yourself

Writing an instructor’s solution with tests that thoroughly cover all the expected behavior

Practice what you are teaching/preaching


Students get frustrated without feedback, so reference tests must provide some

If students only get a score, but no other feedback for how to improve, they get easily frustrated

We augment our reference tests to provide “hints” for failed tests, cross-referenced to the program assignment

Requirements in assignment spec

mul: this command takes two arguments from the evaluation stack and multiplies them11.

Feedback to student on failed test

Your testing does not fully cover (11)

More detailed alternate feedback

(11) mul command failed, expected 4 but received 8


Students will try to get Web-CAT to do their work for them

Students appreciate the feedback, but will avoid thinking at (nearly) all costs

Too much feedback encourages students to use Web-CAT for testing instead of writing their own tests—they use it as a development tool instead of simply to check their work

This limits the learning benefits, which come in large part from students writing their own tests

Lesson: balance providing suggestive feedback without “giving away” the answers: lead the student to think about the problem


We have also tried to influence student work habits to improve their success

Encourage early submission by providing extra incentives or using late penalties

Score bonuses and/or penalties are easy Another useful approach:

Generous limit on the total number of submissions (60)

Hints disappear one day before the due date Project closes for one day to encourage students

to step away and reflect on “the last bug” Project opens again for one day with hints re-

enabled, but with a cap on how much the score can improve


Lessons for writing program assignments intended for automatic grading

Requires greater clarity and specificity

Requires you to explicitly decide what you wish to test, and what you wish to leave open to student interpretation

Requires you to unambiguously specify the behaviors you intend to test

Requires preparing a reference solution before the project is due, more upfront work for professors or TAs

Grading is much easier as many things are taken care by Web-CAT; course staff can focus on assessing design


Areas to look out for in writing “testable” assignments

How do you write tests for the following:

Main programs

Code that reads/write to/from stdin/stdout or files

Code with graphical output

Code with a graphical user interface


Testing main programs

The key: think in object-oriented terms

There should be a principal class that does all the work, and a really short main program

The problem is then simply how to test the principal class (i.e., test all of its methods)

Make sure you specify your assignments so that such principal classes provide enough accessors to inspect or extract what you need to test


Testing input and output behavior

The key: specify assignments so that input and output use streams given as parameters, and are not hard-coded to specific sources destinations

Then use string-based streams to write test cases; show students how

In Java, we use BufferedReaders and PrintWriters for all I/O

In C++, we use istreams and ostreams for all I/O


Testing programs with graphical output

The key: if graphics are only for output, you can ignore them in testing

Ensure there are enough methods to extract the key data in test cases

We use this approach for testing Karel the Robot programs, which use graphic animation so students can observe behavior


Testing programs with graphical UIs

This is a harder problem—maybe too distracting for many students, depending on their level

The key question: what is the goal in writing the tests? Is it the GUI you want to test, some internal behavior, or both?

Three basic approaches:

Specify a well-defined boundary between the GUI and the core, and only test the core code

Switch in an alternative implementation of the UI classes during testing

Test by simulating GUI events


Conclusion: including software testing helps promote learning and performance

If you require students to write their own tests …

Our experience indicates students are more likely to complete assignments on time, produce one third less bugs, and achieve higher grades on assignments

It is definitely more work for the instructor

But it definitely improves the quality of programming assignment writeups and student submissions


Visit our SourceForge project!

http://web-cat.sourceforge.net/

Info about using our automated grader, getting trial accounts, etc.

Movies of making submissions, setting up assignments, and more

Custom Eclipse plug-ins for C++-style TDD

Links to our own Eclipse feature site

automatically grading programming assignments with web-cat stephen h. edwards virginia tech dept. of...

Documents

webcat stephen

webcat slide

webcat students

situation slide

current testing practices

error slide

classroom practices

students frame