language engineering for human-computer collaborative assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Language Engineering for Human-Computer Collaborative Assessment

Mary McGee Wood

John SargeantPhil Reed, Craig Jones

School of Computer Science


The Assess by Computer (ABC) project

• Tools for setting, taking, and marking exams and for admin tasks

• Internally funded by the University of Manchester

• In use for diagnostic, formative, and “high stakes” summative tests, locally and remotely

• HCCA philosophy throughout

• Started as a pragmatic development; gradually turning into a research project.


The problem

• “Every hour of marking is an hour less of life”

• But: we mostly want students’ answers to be constructions, not selections…

• … and accurate autonomous marking of constructed answers (for content) is infeasible.

• And… we also need to improve the quality and accountability of assessment.


Current systems

• Commercially available tools (e.g. QMP) have little or no support for constructed answers (and have other disadvantages)

• Substantial work on Automated Essay Scoring in the States, especially Educational Testing Service (ETS), Princeton, USA

• E-rater – “Essay rater” – concentrates on style and language use.

• C-rater – “Concept rater” – looks at the factual content of answers (85% CH agreement, 92% HH agreement).


HCC – the key idea

• Fully Automatic High Quality Machine Translation (FAHQMT) was never realistic

• FAHQM Anything is probably neither possible nor reasonable

• Aim to exploit the complementary strengths of the system and the user


HCC assessment

• Assessment is a collaborative process where human and program each do what they are good at.

• Answer Representation (AR) grows dynamically during the marking process.

• Aim is to improve both speed and quality of assessment.


HHC software development

• Software development can be a collaborative process where developers and users each do what they are good at.

• System functionality grows dynamically during the use-and-development process.

• Initial aim is to optimise the suitability and habitability of the system.

• Real aim is to improve both speed and quality of carrying out the task in hand.


A marking tool


Answer types

• Multiple choice - useful where appropriate.

• Text – single word to essay. Most common type, can include structured text, e.g. programs, simple maths.

• Slots/fill-in-the-blanks.

• Simple diagrams (experimental).

• Formatted maths – next phase.

Can be used in any combination, structured using composite questions.


Text answer types

• Traditionally: “short answers” vs. “essays”

• Maybe better: “factual” vs “discursive”…

• … or “objective” vs. “subjective”

Hypothesis: Objective answers can usefully be semi automatically marked using simple statistical clustering and matching techniques, while subjective answers require some amount of “natural language understanding”.


What students really say

• Spelling mistakes

• Word variants

• Context-dependent synonyms

• Original answers


Spelling mistakes

• interpretor, interperetor, interaper (not sure about spelling), …

• hierarchial, hierachical, hirarachical, …

• defieciency, deficency, defiency, defficiency, definciency, dificiency, defciency, defficiency, dfficiency, …

• But: modal / model casual / causal


Word variants

• “rhesus positive”: 281 students produced 52 forms, with six parameters of variation:

• Upper / lower case: rhesus / Rhesus / RHESUS

• Hyphenation: RH-positive / Rh positive

• Spacing: Rh +ve / RH+ve

• Parentheses: +ve / (+ve)

• "D": Rh positive / Rh D positive

• "positive": positive/ pos./ pos / + / +ve / +ive


Context-dependent synonyms

• Working memory, Rule memory, Inference Engine

• Rule Memmory - which rules are avilable, Main Memory - the current state of the world , Interpretor - decides which rule fires

• The knowledge, the rules that operate on the knowledge, and the Intepreter that links the these two.


Original answers

• “Give an original example of an exception to default inheritance.”

• 9 penguins, 6 ostriches; 20 non-flying birds in total

• 8 non-walking mammals, 30 other anomalous animals, 31 disabled animals

• 5 plants, 28 artefacts


Really original answers

I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish.

In other words i dont know the answer, sorry, hope u can have a good laugh at my expense though!!!! :) p.s. If you havent seen red dwarf then you'll think im odd for the i am a fishj bit, but if you have seen it dont you think its cool!!!


Simple tools can do a lot

• General problem is hard

• Simple tools can save significant marking time compared to paper exams

• Can display all answers to one part-question together

• Order, e.g. by length, highlight keywords (with optional fuzzy matching) etc..

• ..and you don’t have to read their handwriting!


Clustering

• Each answer (including the model answer) abstracted into a numerical form to enable measurement of similarity with other answers

• Similarity of each answer with each other answer measured and stored in an answer-by-answer similarity matrix

• Clustering algorithm applied to the matrix


Abstraction

• Vector Space Model

• Vectors refined by:

Spelling correction

Stoplist removal

Stemming

Term weighting


Similarity measurement

• Cosine distance (standard)

• Stored in an answer-by-answer similarity matrix

• Generic: can handle many other question types, eg diagrams


Clustering algorithm

• Agglomerative Hierarchical Clustering

• Number of clusters not known in advance

• “Average Within Cluster Similarity” a clue to reliability as a basis for marking


Example 1: Production systems

• “What are the three components of a production system?”

• 151 student answers

Cluster 1 50 working memory, rule memory, interpreter

Cluster 2 8 1. working memory, 2. rule memory, 3. Interpreter

Cluster 3 6 working memory, rule memory, interpreter (inference engine)

Cluster 4 5 working memory, rule memory, interpretor

Cluster 5 3 working memory, rule memory, interpretter

Cluster 6 3 include the phrase “the three components”

Cluster 7 3 working memory, rule memory, inference engine

…

Outliers 65


Outliers

• Unique mis-spellings:

Rule memory, working memory and interperetor

• Correct answers uniquely expressed:

Working memory - contains state; Rule memory - contains rules; Interpreter - decides which to fire

• Unique wrong answers:

I am a fish. I am a fish. I am a fish.


Example 2: Iron deficiency

• “Name one deficiency which would give rise to a microcytic anaemia.”


• 17 clusters, 79 outliers Cluster 1 132 iron deficiency and minor variants

collapsed by pre-processing

Cluster 2 15 iron

Cluster 3 8 iron deficiency, diet

Cluster 4 7 iron deficiency anemia

&c


• “What single measurement would you make to confirm that an individual is anaemic?.”

• 22 clusters, 85 outliers Cluster 1 67 haemoglobin concentration

Cluster 2 42 red blood cell count

Cluster 3 15 packed cell volume

Cluster 4 13 minor variants on haemoglobin

concentration in the blood

&c


Example 3: The frame problem

• “What, in Artificial Intelligence, is the Frame Problem?.”


• AWCS relaxed to 0.90, giving 12 clusters, 39 outliers

Cluster 1 19 real world, chang- (change, changes,

changing, &c)

Cluster 2 14 world, chang-

Cluster 3 8 frame

Cluster 4 6 exceptions, inheritance

Cluster 5 4 chang-, repres-

&c


Benefits

• Marking time reduced by factor of 2-3 compared to paper scripts

• Can include MCQs where appropriate - they’re not always bad

• Answers genuinely anonymous

• Consistency likely to improve

• Clerical checking eliminated

• Detailed analysis of results possible – good for “drilling down”.

• Lots of data generated for further research


Collaboration: software development

• Initial system did little more than replace paper exam books

• Gradually extending real (diagnostic, formative, and summative) use

• Gradually extending functionality

• Priorities for development influenced by users and would-be users, e.g.current top priority is formatted maths…

• …and by real student answers.


Conclusions: assessmentIn decreasing order of confidence:

• HCCA, even with very simple tools, is very effective, at least in some cases.

• There are many issues of usability, procedures, education…

• Simple keyword-based answers are easy for HCCA but hard for machines alone.

• Discursive / subjective answers probably require a range of NLE techniques.


Conclusions: NLE

• Applications of NLE don’t have to be “all or nothing” …

• … which is just as well, because even “simple” real data is complicated.

• HCC gets the best from both machine and user…

• … and means that very simple techniques can be Really Useful.

language engineering for human-computer collaborative assessment

Documents