Download - Improving the Automatic Evaluation of Problem Solutions in Programming Contests

Portugal

Improving theAutomatic Evaluation of

Problem Solutions in Programming Contests

Pedro Ribeiro and Pedro Guerreiro

2Plovdiv – IOI’2009P. Ribeiro, P. Guerreiro

Improving the Automatic

Evaluation of Problem

Solutions in Programming

Contests

Presentation Overview

• Automatic Evaluation: Past and Present– The case of IOI

• A possible path for improving evaluation– Developing only a function (not a complete program)– Abstract Input/Output– Repeat the same function call (+ clock precision)– No hints on expected complexity– Examine runtime behaviour as tests increase in size

• Some preliminary results• Conclusions





Contests

• All programming contests need an efficient and fair way of distinguishing submitted solutions

(Automatic) Evaluation

• What do we evaluate?– Correction: does the program produce correct

answers for all instances of the problem?

– Efficiency: does it do it fast enough? Does it have the necessary time and memory complexity?

Programming Contests





Contests

• Classic way of evaluating– Set of pre-defined tests (inputs)– Run program with tests and check output

• IOI has been doing this almost the same way since the beginning with two major advances:– Manual evaluation > Automatic evaluation– Individual Tests -> Grouped tests

• Although IOI has 3 different types of tasks, the main core of the event are still batch tasks

Programming Contests





Contests

IOI Types of Tasks

IOI Year Batch Tasks Reactive Tasks Output Only

2009 7 1 0

2008 6 0 0

2007 5 1 0

2006 4 0 2

2005 5 1 0

2004 5 0 1





Contests

Programming Contests• Correction: almost “black art”

– “Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence” (Dijkstra)

• Efficiency:– Typically judges create set of model solutions of different

complexities– Tests designed in that model solutions achieve planned

number of points– Considerable amount of tuning (environment)– Considerable amount of man power needed– More difficult to introduce new languages





Contests

• Solve the problem by writing a specific function (as opposed to a complete program)

• Motivation:– Concentrate on the core algorithm (less distractors)– Can be used on earlier stages of learning– Opportunities for new ways of testing

(more control on submitted code)

• It is already done on other types of contests:– TopCoder– Teaching Environments (Ribeiro and Guerreiro, 2008)

Ideas: Single function





Contests

Ideas: I/O Abstraction• The Input and Output should be “abstract” and

not specific to a language

• How to do it:– Input already in memory, passed as function arguments

(simple form, no complex data structure)– Output as the function return value(s)

• Motivation:– Less information processing details– Less complicated problem statements– We can measure time spent in solution (not in I/O)– More balanced performance between languages





Contests

Idea: Repeat function calls• In the past we used smaller input sizes

increased speed

of computers• Currently we use huge input sizes

– Clock resolution is poor: small instances > instant– Need to distinguish small asymptotic complexities– Historic fact: Smaller time limit used on IOI:

• IOI 2007, problem training: 0.3 seconds

• Future?– Always more speed > bigger input size





Contests

Idea: Repeat function calls• Problems completely detached from reality:

– Ex: IOI 2007 Sails, ship with 100,000 masts





Contests

Idea: Repeat function calls• Real world: How can we measure the thickness

of a sheet of paper if we have a standard ruler without enough accuracy?

stack of 100 sheets measures 1cm, then each sheet is ~0.1mm

• We can use the same idea on functions!– Run once with small instances may be instantaneousBut– Running multiple times takes more than 0.00s!





Contests

Idea: Repeat function calls• Run the same functions several times and

compute average time

• Pros– Input size can be smaller and related to problem– We can concentrate on quality of test cases and rely

less on randomization to produce big test cases that are impossible to verify manually

• Cons– We must be careful with memory persistence

between successive function calls





Contests

Idea: No hints on complexity• When we give limits for the input:

– we simplify implementation details and avoid the need for dynamic memory allocation.

but– We disclose the complexity required for the problem

• Trained students can identify precisely the complexity needed

• This has great impact on problem solving aspect:– Different mindset: I know which complexity I’m looking

for and I settle for a solution that does thatvs– Scientific approach with real world open problem

– Ex: is there a polynomial solution for a problem?





Contests

Idea: No hints on complexity• Give limits for implementation purposes, but

make it clear that those are not related to sought efficiency

• More scientific and open ended approach• Need to think how to really solve the problem

(and not how to produce a program that passes the test cases)

• Not overemphasize runtime of particular language– (let me make a test with maximum limits and see if it

runs in X seconds on this machine with this language)





Contests

Idea: Runtime behaviour as tests increase

• Typically we measure efficiency by creating set of tests such that different model solutions achieve different number of points

But

• not passing does not imply that the required complexity was not achieved (other factors)– Just means that the test case is solved within the constraints

• A lot of man power needed for model solutions and fine tuning (compiler version, computer speed, language used, etc)





Contests

Idea: Runtime behaviour as tests increase

• How can we improve on that?

• Pen and Paper not an option for large scale evaluation– Need for automatic processes

• We have different tests, we have different time measures, why don’t we use all this information?

• Plot the runtime as data increases and do some curve fitting– Impossible to determine complexity for all programs, but even

a trivial (imperfect) curve can show more information than just knowing which test cases are passed





Contests

Some Preliminary Results• As a proof of concept a simple problem:

– Input: Sequence of integers– Output: Subsequence of consecutive integers with

maximum sum

• Only ask for function with I/O already given• Small input limit (only 100)• Measure time by running multiple times

(until aggregated time reached 1s)• Use random data for 1,4,8,12,…64

1 -2 3 10 -4 3 -6 4 -1 1





Contests

Some Preliminary Results• Implemented 3 model solutions:

– O(N^3) – Iterate all possible intervals in O(N^2) plus iterate trough each interval to discover sum in O(N)

– O(N^2) – Iterate all possible intervals in O(N^2) plus O(1) checking of each sum with accumulated sums

– O(N) – Iterate trough sequence and keep partial sum, whenever the partial sum is negative, it cannot contribute to best and therefore “reset” to zero and continue

A

B

C





Contests

Some Preliminary Results• Plot Time(N) / Time(1)

• Simple correlation measure with another functionSolution log N N N log N N^2 N^3 N^4 2^N N!

A 0.6848 0.9264 0.9515 0.9912 0.9993 0.9869 0.6033 0.5722

B 0.7524 0.9666 0.9835 0.9998 0.9848 0.9564 0.5469 0.5183

A 0.8624 0.9952 0.9927 0.9586 0.8985 0.8417 0.4136 0.3906





Contests

Some Preliminary Results• Out of scope to give more detailed mathematical

analysis– We could use other statistical measures

• We know that it is impossible to automatically compute and prove complexities

but• This simple approach gives meaningful results

– runtime is somehow consistent and correlated with a certain function and therefore appears to grow following a pattern that we were able to identify

• Ex: Linear > appears to take twice the time when data doubles





Contests

Some Preliminary Results• What could this do?

– More information from the same test cases– Possibility of giving students automatic feedback on

runtime behavior– Possibility of identifying runtime behaviors for which

no model solutions were created (less man power!)– Independent of language specific details

Ex: Archery Problem, IOI 2009, Day 1There were solutions with O(N^2R), O(N^3), O(N^2 log N),

O(N^2), O(N log N), …

No need to code them all in all languages and then tune!





Contests

Conclusion• 20 Years of IOI: computers are much faster, style of

evaluation is still the same• Setting up test cases is time consuming and requires

man power• Need to think of ways to improve evaluation• Our proposal, geared to more informal contests or

teaching environments, can offer:– No distraction with I/O– No large data sets– More natural problem statements– No hint on complexity (open ended approach)– No need for implementing many model solutions– New languages can be added without changing tests

• Still more work to obtain robust system but we feel this ideas (or some of them) can be used in practice

• Future: can evaluation be improved in other ways?





Contests

The End

• And that’s all!:-)

Questions?

Pedro Ribeiro ([email protected])

Pedro Guerreiro ([email protected])

Download - Improving the Automatic Evaluation of Problem Solutions in Programming Contests

Top Related