language engineering for human-computer collaborative assessment

33
Combining the strengths of UMIST and The Victoria University of Manchester Language Engineering for Human-Computer Collaborative Assessment Mary McGee Wood John Sargeant Phil Reed, Craig Jones School of Computer Science

Upload: orsin

Post on 27-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Language Engineering for Human-Computer Collaborative Assessment. Mary McGee Wood John Sargeant Phil Reed, Craig Jones School of Computer Science. The Assess by Computer (ABC) project. Tools for setting, taking, and marking exams and for admin tasks - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Language Engineering for Human-Computer Collaborative Assessment

Mary McGee Wood

John SargeantPhil Reed, Craig Jones

School of Computer Science

Page 2: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

The Assess by Computer (ABC) project

• Tools for setting, taking, and marking exams and for admin tasks

• Internally funded by the University of Manchester

• In use for diagnostic, formative, and “high stakes” summative tests, locally and remotely

• HCCA philosophy throughout

• Started as a pragmatic development; gradually turning into a research project.

Page 3: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

The problem

• “Every hour of marking is an hour less of life”

• But: we mostly want students’ answers to be constructions, not selections…

• … and accurate autonomous marking of constructed answers (for content) is infeasible.

• And… we also need to improve the quality and accountability of assessment.

Page 4: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Current systems

• Commercially available tools (e.g. QMP) have little or no support for constructed answers (and have other disadvantages)

• Substantial work on Automated Essay Scoring in the States, especially Educational Testing Service (ETS), Princeton, USA

• E-rater – “Essay rater” – concentrates on style and language use.

• C-rater – “Concept rater” – looks at the factual content of answers (85% CH agreement, 92% HH agreement).

Page 5: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

HCC – the key idea

• Fully Automatic High Quality Machine Translation (FAHQMT) was never realistic

• FAHQM Anything is probably neither possible nor reasonable

• Aim to exploit the complementary strengths of the system and the user

Page 6: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

HCC assessment

• Assessment is a collaborative process where human and program each do what they are good at.

• Answer Representation (AR) grows dynamically during the marking process.

• Aim is to improve both speed and quality of assessment.

Page 7: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

HHC software development

• Software development can be a collaborative process where developers and users each do what they are good at.

• System functionality grows dynamically during the use-and-development process.

• Initial aim is to optimise the suitability and habitability of the system.

• Real aim is to improve both speed and quality of carrying out the task in hand.

Page 8: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

A marking tool

Page 9: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Answer types

• Multiple choice - useful where appropriate.

• Text – single word to essay. Most common type, can include structured text, e.g. programs, simple maths.

• Slots/fill-in-the-blanks.

• Simple diagrams (experimental).

• Formatted maths – next phase.

Can be used in any combination, structured using composite questions.

Page 10: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Text answer types

• Traditionally: “short answers” vs. “essays”

• Maybe better: “factual” vs “discursive”…

• … or “objective” vs. “subjective”

Hypothesis: Objective answers can usefully be semi automatically marked using simple statistical clustering and matching techniques, while subjective answers require some amount of “natural language understanding”.

Page 11: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

What students really say

• Spelling mistakes

• Word variants

• Context-dependent synonyms

• Original answers

Page 12: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Spelling mistakes

• interpretor, interperetor, interaper (not sure about spelling), …

• hierarchial, hierachical, hirarachical, …

• defieciency, deficency, defiency, defficiency, definciency, dificiency, defciency, defficiency, dfficiency, …

• But: modal / model casual / causal

Page 13: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Word variants

• “rhesus positive”: 281 students produced 52 forms, with six parameters of variation:

• Upper / lower case: rhesus / Rhesus / RHESUS

• Hyphenation: RH-positive / Rh positive

• Spacing: Rh +ve / RH+ve

• Parentheses: +ve / (+ve)

• "D": Rh positive / Rh D positive

• "positive": positive/ pos./ pos / + / +ve / +ive

Page 14: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Context-dependent synonyms

• Working memory, Rule memory, Inference Engine

• Rule Memmory - which rules are avilable, Main Memory - the current state of the world , Interpretor - decides which rule fires

• The knowledge, the rules that operate on the knowledge, and the Intepreter that links the these two.

Page 15: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Original answers

• “Give an original example of an exception to default inheritance.”

• 9 penguins, 6 ostriches; 20 non-flying birds in total

• 8 non-walking mammals, 30 other anomalous animals, 31 disabled animals

• 5 plants, 28 artefacts

Page 16: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Really original answers

I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish.

In other words i dont know the answer, sorry, hope u can have a good laugh at my expense though!!!! :) p.s. If you havent seen red dwarf then you'll think im odd for the i am a fishj bit, but if you have seen it dont you think its cool!!!

Page 17: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Simple tools can do a lot

• General problem is hard

• Simple tools can save significant marking time compared to paper exams

• Can display all answers to one part-question together

• Order, e.g. by length, highlight keywords (with optional fuzzy matching) etc..

• ..and you don’t have to read their handwriting!

Page 18: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Clustering

• Each answer (including the model answer) abstracted into a numerical form to enable measurement of similarity with other answers

• Similarity of each answer with each other answer measured and stored in an answer-by-answer similarity matrix

• Clustering algorithm applied to the matrix

Page 19: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Page 20: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Abstraction

• Vector Space Model

• Vectors refined by:

Spelling correction

Stoplist removal

Stemming

Term weighting

Page 21: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Page 22: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Similarity measurement

• Cosine distance (standard)

• Stored in an answer-by-answer similarity matrix

• Generic: can handle many other question types, eg diagrams

Page 23: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Clustering algorithm

• Agglomerative Hierarchical Clustering

• Number of clusters not known in advance

• “Average Within Cluster Similarity” a clue to reliability as a basis for marking

Page 24: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Page 25: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Example 1: Production systems

• “What are the three components of a production system?”

• 151 student answers

Cluster 1 50 working memory, rule memory, interpreter

Cluster 2 8 1. working memory, 2. rule memory, 3. Interpreter

Cluster 3 6 working memory, rule memory, interpreter (inference engine)

Cluster 4 5 working memory, rule memory, interpretor

Cluster 5 3 working memory, rule memory, interpretter

Cluster 6 3 include the phrase “the three components”

Cluster 7 3 working memory, rule memory, inference engine

Outliers 65

Page 26: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Outliers

• Unique mis-spellings:

Rule memory, working memory and interperetor

• Correct answers uniquely expressed:

Working memory - contains state; Rule memory - contains rules; Interpreter - decides which to fire

• Unique wrong answers:

I am a fish. I am a fish. I am a fish.

Page 27: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Example 2: Iron deficiency

• “Name one deficiency which would give rise to a microcytic anaemia.”

• 279 student answers

• 17 clusters, 79 outliers Cluster 1 132 iron deficiency and minor variants

collapsed by pre-processing

Cluster 2 15 iron

Cluster 3 8 iron deficiency, diet

Cluster 4 7 iron deficiency anemia

&c

Page 28: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

• “What single measurement would you make to confirm that an individual is anaemic?.”

• 22 clusters, 85 outliers Cluster 1 67 haemoglobin concentration

Cluster 2 42 red blood cell count

Cluster 3 15 packed cell volume

Cluster 4 13 minor variants on haemoglobin

concentration in the blood

&c

Page 29: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Example 3: The frame problem

• “What, in Artificial Intelligence, is the Frame Problem?.”

• 104 student answers

• AWCS relaxed to 0.90, giving 12 clusters, 39 outliers

Cluster 1 19 real world, chang- (change, changes,

changing, &c)

Cluster 2 14 world, chang-

Cluster 3 8 frame

Cluster 4 6 exceptions, inheritance

Cluster 5 4 chang-, repres-

&c

Page 30: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Benefits

• Marking time reduced by factor of 2-3 compared to paper scripts

• Can include MCQs where appropriate - they’re not always bad

• Answers genuinely anonymous

• Consistency likely to improve

• Clerical checking eliminated

• Detailed analysis of results possible – good for “drilling down”.

• Lots of data generated for further research

Page 31: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Collaboration: software development

• Initial system did little more than replace paper exam books

• Gradually extending real (diagnostic, formative, and summative) use

• Gradually extending functionality

• Priorities for development influenced by users and would-be users, e.g.current top priority is formatted maths…

• …and by real student answers.

Page 32: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Conclusions: assessmentIn decreasing order of confidence:

• HCCA, even with very simple tools, is very effective, at least in some cases.

• There are many issues of usability, procedures, education…

• Simple keyword-based answers are easy for HCCA but hard for machines alone.

• Discursive / subjective answers probably require a range of NLE techniques.

Page 33: Language Engineering for  Human-Computer Collaborative  Assessment

Combining the strengths of UMIST andThe Victoria University of Manchester

Conclusions: NLE

• Applications of NLE don’t have to be “all or nothing” …

• … which is just as well, because even “simple” real data is complicated.

• HCC gets the best from both machine and user…

• … and means that very simple techniques can be Really Useful.